Hi Chris

Another option (for Beam) was passing a custom content handler via the serialized lambda expression - which sounds like a black magic to ne at the moment but I'm curious :-)

I thought, assuming TikaConfig is only used once to bootstrap, then passing a ContentHandler class name might work. You are right, the headers are better suited for the dynamic cases...

Cheers, Sergey
On 28/09/17 22:35, Chris Mattmann wrote:
Hmm, cool.

Can we support both? If I don’t have to modify/ship a Tika config (which is a 
runtime
configuration) and I can, on a per call invocation, change the ContentHandler, 
it would
be MUCH easier in downstream libraries like Tika Python that rely on the REST 
server.
These are documented here:

https://wiki.apache.org/tika/API%20Bindings%20for%20Tika

Cheers,
Chris




On 9/28/17, 2:26 PM, "Sergey Beryozkin" <[email protected]> wrote:

     Hi
Option #1 is also good - a question how to pass a ContentHandler to a
     Beam function was open, and given that passing TikaConfig is needed
     anyway, having a way to specify a handler there can be handy too...
Cheers, Sergey
     On 28/09/17 22:17, Chris Mattmann wrote:
     > I am +1 for this. Option #2 sounds like a slick way to handle this for 
me that would
     > remain back compat with tika-python which is of strong interest to me.
     >
     > Cheers,
     > Chris
     >
     >
     >
     >
     > On 9/28/17, 1:35 PM, "Giuseppe Totaro" <[email protected]> wrote:
     >
     >      Hi folks,
     >
     >      if I am not wrong, currently you cannot configure a specific 
ContentHandler
     >      while using tika-server. I mean that you can configure your own 
parser [0]
     >      but you cannot control which ContentHandler the parser leverages to 
extract
     >      text and metadata (e.g., you cannot use 
PhoneExtractingContentHandler,
     >      StandardsExtractingContentHandler, etc).
     >      If it is correct, it would be nice to enable the use of specific
     >      ContentHandlers within tika-server and I would like to discuss how 
to solve
     >      this issue generally.
     >
     >      I propose two solutions:
     >
     >         1. augment the TikaConfig class so that a specific 
ContentHandler can be
     >         used in tika-config.xml;
     >         2. determine the ContentHandler to use for parsing through HTTP 
headers,
     >         for example:
     >         curl -T filename.pdf http://localhost:9998/meta --header
     >         "X-Content-Handler: PhoneExtractingContentHandler"
     >         This should affect also the TikaResource.java class.
     >
     >      I look forward to having your feedback. I strongly believe that 
every user
     >      who wants to use Tika as a service through tika-server and needs to 
extract
     >      content and metadata like phone numbers, standard references, etc 
would be
     >      very happy.
     >
     >      Thanks a lot,
     >      Giuseppe
     >
     >
     >

Reply via email to