On 06/10/17 18:08, Konstantin Gribov wrote:
My +1 to this idea.

IMHO, second option is more flexible. I also like Nick's suggestion about
using default package for handlers and interpret dot-separated string as
fqcn. Solr does similar thing and it's very convenient to use (but they use
prefix `solr.` for their classes in predefined package and any other is
interpreted as fqcn).

I'll add that you could allow user to pass several comma-separated handlers
to allow build content-handler stack if user wants to.

I would disagree with Sergey about serialized lambdas for 2 reasons:
- it's useful only for java-clients;
- it could bring very nasty bugs leading to RCE class vulnerabilities, so
it's very controversial from security PoV.
Sure. I was not actually suggesting to use them in Tika natively, I only referred to it as the alternative mentioned in the context of the Beam integration work

Sergey

On Thu, Sep 28, 2017 at 11:35 PM Giuseppe Totaro <totarope...@gmail.com>
wrote:

Hi folks,

if I am not wrong, currently you cannot configure a specific ContentHandler
while using tika-server. I mean that you can configure your own parser [0]
but you cannot control which ContentHandler the parser leverages to extract
text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
StandardsExtractingContentHandler, etc).
If it is correct, it would be nice to enable the use of specific
ContentHandlers within tika-server and I would like to discuss how to solve
this issue generally.

I propose two solutions:

    1. augment the TikaConfig class so that a specific ContentHandler can be
    used in tika-config.xml;
    2. determine the ContentHandler to use for parsing through HTTP headers,
    for example:
    curl -T filename.pdf http://localhost:9998/meta --header
    "X-Content-Handler: PhoneExtractingContentHandler"
    This should affect also the TikaResource.java class.

I look forward to having your feedback. I strongly believe that every user
who wants to use Tika as a service through tika-server and needs to extract
content and metadata like phone numbers, standard references, etc would be
very happy.

Thanks a lot,
Giuseppe

Reply via email to