Hi

Option #1 is also good - a question how to pass a ContentHandler to a Beam function was open, and given that passing TikaConfig is needed anyway, having a way to specify a handler there can be handy too...

Cheers, Sergey
On 28/09/17 22:17, Chris Mattmann wrote:
I am +1 for this. Option #2 sounds like a slick way to handle this for me that 
would
remain back compat with tika-python which is of strong interest to me.

Cheers,
Chris




On 9/28/17, 1:35 PM, "Giuseppe Totaro" <[email protected]> wrote:

     Hi folks,
if I am not wrong, currently you cannot configure a specific ContentHandler
     while using tika-server. I mean that you can configure your own parser [0]
     but you cannot control which ContentHandler the parser leverages to extract
     text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
     StandardsExtractingContentHandler, etc).
     If it is correct, it would be nice to enable the use of specific
     ContentHandlers within tika-server and I would like to discuss how to solve
     this issue generally.
I propose two solutions: 1. augment the TikaConfig class so that a specific ContentHandler can be
        used in tika-config.xml;
        2. determine the ContentHandler to use for parsing through HTTP headers,
        for example:
        curl -T filename.pdf http://localhost:9998/meta --header
        "X-Content-Handler: PhoneExtractingContentHandler"
        This should affect also the TikaResource.java class.
I look forward to having your feedback. I strongly believe that every user
     who wants to use Tika as a service through tika-server and needs to extract
     content and metadata like phone numbers, standard references, etc would be
     very happy.
Thanks a lot,
     Giuseppe

Reply via email to