My +1 to this idea. IMHO, second option is more flexible. I also like Nick's suggestion about using default package for handlers and interpret dot-separated string as fqcn. Solr does similar thing and it's very convenient to use (but they use prefix `solr.` for their classes in predefined package and any other is interpreted as fqcn).
I'll add that you could allow user to pass several comma-separated handlers to allow build content-handler stack if user wants to. I would disagree with Sergey about serialized lambdas for 2 reasons: - it's useful only for java-clients; - it could bring very nasty bugs leading to RCE class vulnerabilities, so it's very controversial from security PoV. On Thu, Sep 28, 2017 at 11:35 PM Giuseppe Totaro <totarope...@gmail.com> wrote: > Hi folks, > > if I am not wrong, currently you cannot configure a specific ContentHandler > while using tika-server. I mean that you can configure your own parser [0] > but you cannot control which ContentHandler the parser leverages to extract > text and metadata (e.g., you cannot use PhoneExtractingContentHandler, > StandardsExtractingContentHandler, etc). > If it is correct, it would be nice to enable the use of specific > ContentHandlers within tika-server and I would like to discuss how to solve > this issue generally. > > I propose two solutions: > > 1. augment the TikaConfig class so that a specific ContentHandler can be > used in tika-config.xml; > 2. determine the ContentHandler to use for parsing through HTTP headers, > for example: > curl -T filename.pdf http://localhost:9998/meta --header > "X-Content-Handler: PhoneExtractingContentHandler" > This should affect also the TikaResource.java class. > > I look forward to having your feedback. I strongly believe that every user > who wants to use Tika as a service through tika-server and needs to extract > content and metadata like phone numbers, standard references, etc would be > very happy. > > Thanks a lot, > Giuseppe > -- Best regards, Konstantin Gribov