Hmm, cool.

Can we support both? If I don’t have to modify/ship a Tika config (which is a 
runtime
configuration) and I can, on a per call invocation, change the ContentHandler, 
it would
be MUCH easier in downstream libraries like Tika Python that rely on the REST 
server.
These are documented here:

https://wiki.apache.org/tika/API%20Bindings%20for%20Tika

Cheers,
Chris




On 9/28/17, 2:26 PM, "Sergey Beryozkin" <[email protected]> wrote:

    Hi
    
    Option #1 is also good - a question how to pass a ContentHandler to a 
    Beam function was open, and given that passing TikaConfig is needed 
    anyway, having a way to specify a handler there can be handy too...
    
    Cheers, Sergey
    On 28/09/17 22:17, Chris Mattmann wrote:
    > I am +1 for this. Option #2 sounds like a slick way to handle this for me 
that would
    > remain back compat with tika-python which is of strong interest to me.
    > 
    > Cheers,
    > Chris
    > 
    > 
    > 
    > 
    > On 9/28/17, 1:35 PM, "Giuseppe Totaro" <[email protected]> wrote:
    > 
    >      Hi folks,
    >      
    >      if I am not wrong, currently you cannot configure a specific 
ContentHandler
    >      while using tika-server. I mean that you can configure your own 
parser [0]
    >      but you cannot control which ContentHandler the parser leverages to 
extract
    >      text and metadata (e.g., you cannot use 
PhoneExtractingContentHandler,
    >      StandardsExtractingContentHandler, etc).
    >      If it is correct, it would be nice to enable the use of specific
    >      ContentHandlers within tika-server and I would like to discuss how 
to solve
    >      this issue generally.
    >      
    >      I propose two solutions:
    >      
    >         1. augment the TikaConfig class so that a specific ContentHandler 
can be
    >         used in tika-config.xml;
    >         2. determine the ContentHandler to use for parsing through HTTP 
headers,
    >         for example:
    >         curl -T filename.pdf http://localhost:9998/meta --header
    >         "X-Content-Handler: PhoneExtractingContentHandler"
    >         This should affect also the TikaResource.java class.
    >      
    >      I look forward to having your feedback. I strongly believe that 
every user
    >      who wants to use Tika as a service through tika-server and needs to 
extract
    >      content and metadata like phone numbers, standard references, etc 
would be
    >      very happy.
    >      
    >      Thanks a lot,
    >      Giuseppe
    >      
    > 
    > 
    


Reply via email to