Hi folks,

first of all, I want to express my gratitude for your feedback and
insightful suggestions.

To sum up, I would like to quickly discuss the following aspects:

   - As you all mentioned, the HTTP headers for configuring the
   ContentHandler to be used are better suited for the dynamic cases.
   Specifically, a ContentHadler can be given through an ad-hoc header, e.g.
   -H "X-Content-Handler: StandardsExtractingContentHandler", parsed and used
   run-time within tika-server.
   - Nick, I believe that providing the ability to determine the
   ContentHandler through a command-line option is a great idea. It could be
   better also for users.

Please let me implement both solutions and provide an example in the next
days that we can discuss.

Thanks again for your kind availability,
Giuseppe


On Thu, Sep 28, 2017 at 10:08 PM, Nick Burch <[email protected]> wrote:

> On Thu, 28 Sep 2017, Giuseppe Totaro wrote:
>
>> if I am not wrong, currently you cannot configure a specific
>> ContentHandler
>> while using tika-server. I mean that you can configure your own parser [0]
>> but you cannot control which ContentHandler the parser leverages to
>> extract
>> text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
>> StandardsExtractingContentHandler, etc).
>>
>
> I think the long-term plan was to work out a viable plan for laying
> multiple parsers on top of each other, then change some of these to be
> "enhancing parsers" on top. However, that's still on the "TODO" list for
> Tika 2.0, as we've still yet to come up with a good way to allow it to
> happen within the SAX / ContentHandler structure
>
>
> I propose two solutions:
>>
>>   1. augment the TikaConfig class so that a specific ContentHandler can be
>>   used in tika-config.xml;
>>
>
> That feels a bit wrong to me, because in almost all Tika use-cases, the
> value from the Config would be ignored.
>
> Trying to explain to a new user which were the cases where it'd be used,
> and which ones it was ignored, seems hard and confusing too...
>
>
>   2. determine the ContentHandler to use for parsing through HTTP headers,
>>   for example:
>>
>
> We do allow setting of parser config via headers, so this would have
> precidence. It would also allow per-request changing
>
> Otherwise, if server-wide is OK (which your config idea would require
> anyway), might it not be better to make it an option when you start the
> server? I see it as being a bit more like picking a port, in terms of
> something specific to how you run that server instance
>
> eg java -jar tika-server.jar --port 1234 --content-handler
> PhoneExtractingContentHandler
> eg java -jar tika-server.jar --port 1234 --content-handler
> com.example.CustomHandler
>
> Nick
>

Reply via email to