On Thu, 28 Sep 2017, Giuseppe Totaro wrote:
if I am not wrong, currently you cannot configure a specific ContentHandler while using tika-server. I mean that you can configure your own parser [0] but you cannot control which ContentHandler the parser leverages to extract text and metadata (e.g., you cannot use PhoneExtractingContentHandler, StandardsExtractingContentHandler, etc).
I think the long-term plan was to work out a viable plan for laying multiple parsers on top of each other, then change some of these to be "enhancing parsers" on top. However, that's still on the "TODO" list for Tika 2.0, as we've still yet to come up with a good way to allow it to happen within the SAX / ContentHandler structure
I propose two solutions: 1. augment the TikaConfig class so that a specific ContentHandler can be used in tika-config.xml;
That feels a bit wrong to me, because in almost all Tika use-cases, the value from the Config would be ignored.
Trying to explain to a new user which were the cases where it'd be used, and which ones it was ignored, seems hard and confusing too...
2. determine the ContentHandler to use for parsing through HTTP headers, for example:
We do allow setting of parser config via headers, so this would have precidence. It would also allow per-request changing
Otherwise, if server-wide is OK (which your config idea would require anyway), might it not be better to make it an option when you start the server? I see it as being a bit more like picking a port, in terms of something specific to how you run that server instance
eg java -jar tika-server.jar --port 1234 --content-handler PhoneExtractingContentHandler eg java -jar tika-server.jar --port 1234 --content-handler com.example.CustomHandler Nick
