On Thu, 28 Sep 2017, Giuseppe Totaro wrote:
if I am not wrong, currently you cannot configure a specific ContentHandler
while using tika-server. I mean that you can configure your own parser [0]
but you cannot control which ContentHandler the parser leverages to extract
text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
StandardsExtractingContentHandler, etc).

I think the long-term plan was to work out a viable plan for laying multiple parsers on top of each other, then change some of these to be "enhancing parsers" on top. However, that's still on the "TODO" list for Tika 2.0, as we've still yet to come up with a good way to allow it to happen within the SAX / ContentHandler structure


I propose two solutions:

  1. augment the TikaConfig class so that a specific ContentHandler can be
  used in tika-config.xml;

That feels a bit wrong to me, because in almost all Tika use-cases, the value from the Config would be ignored.

Trying to explain to a new user which were the cases where it'd be used, and which ones it was ignored, seems hard and confusing too...


  2. determine the ContentHandler to use for parsing through HTTP headers,
  for example:

We do allow setting of parser config via headers, so this would have precidence. It would also allow per-request changing

Otherwise, if server-wide is OK (which your config idea would require anyway), might it not be better to make it an option when you start the server? I see it as being a bit more like picking a port, in terms of something specific to how you run that server instance

eg java -jar tika-server.jar --port 1234 --content-handler 
PhoneExtractingContentHandler
eg java -jar tika-server.jar --port 1234 --content-handler 
com.example.CustomHandler

Nick

Reply via email to