Hmm, cool. Can we support both? If I don’t have to modify/ship a Tika config (which is a runtime configuration) and I can, on a per call invocation, change the ContentHandler, it would be MUCH easier in downstream libraries like Tika Python that rely on the REST server. These are documented here:
https://wiki.apache.org/tika/API%20Bindings%20for%20Tika Cheers, Chris On 9/28/17, 2:26 PM, "Sergey Beryozkin" <[email protected]> wrote: Hi Option #1 is also good - a question how to pass a ContentHandler to a Beam function was open, and given that passing TikaConfig is needed anyway, having a way to specify a handler there can be handy too... Cheers, Sergey On 28/09/17 22:17, Chris Mattmann wrote: > I am +1 for this. Option #2 sounds like a slick way to handle this for me that would > remain back compat with tika-python which is of strong interest to me. > > Cheers, > Chris > > > > > On 9/28/17, 1:35 PM, "Giuseppe Totaro" <[email protected]> wrote: > > Hi folks, > > if I am not wrong, currently you cannot configure a specific ContentHandler > while using tika-server. I mean that you can configure your own parser [0] > but you cannot control which ContentHandler the parser leverages to extract > text and metadata (e.g., you cannot use PhoneExtractingContentHandler, > StandardsExtractingContentHandler, etc). > If it is correct, it would be nice to enable the use of specific > ContentHandlers within tika-server and I would like to discuss how to solve > this issue generally. > > I propose two solutions: > > 1. augment the TikaConfig class so that a specific ContentHandler can be > used in tika-config.xml; > 2. determine the ContentHandler to use for parsing through HTTP headers, > for example: > curl -T filename.pdf http://localhost:9998/meta --header > "X-Content-Handler: PhoneExtractingContentHandler" > This should affect also the TikaResource.java class. > > I look forward to having your feedback. I strongly believe that every user > who wants to use Tika as a service through tika-server and needs to extract > content and metadata like phone numbers, standard references, etc would be > very happy. > > Thanks a lot, > Giuseppe > > >
