Konstantin, by the way, if you are interested in having a good discussion to do with using the serialized lambdas then you will be welcome to comment on the relevant text in the Tika Concerns Beam thread, though may be Beam knows how to take care of the issues you raised...

Thanks, Sergey
On 06/10/17 18:27, Sergey Beryozkin wrote:
On 06/10/17 18:08, Konstantin Gribov wrote:
My +1 to this idea.

IMHO, second option is more flexible. I also like Nick's suggestion about
using default package for handlers and interpret dot-separated string as
fqcn. Solr does similar thing and it's very convenient to use (but they use
prefix `solr.` for their classes in predefined package and any other is
interpreted as fqcn).

I'll add that you could allow user to pass several comma-separated handlers
to allow build content-handler stack if user wants to.

I would disagree with Sergey about serialized lambdas for 2 reasons:
- it's useful only for java-clients;
- it could bring very nasty bugs leading to RCE class vulnerabilities, so
it's very controversial from security PoV.
Sure. I was not actually suggesting to use them in Tika natively, I only referred to it as the alternative mentioned in the context of the Beam integration work

Sergey

On Thu, Sep 28, 2017 at 11:35 PM Giuseppe Totaro <[email protected]>
wrote:

Hi folks,

if I am not wrong, currently you cannot configure a specific ContentHandler while using tika-server. I mean that you can configure your own parser [0] but you cannot control which ContentHandler the parser leverages to extract
text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
StandardsExtractingContentHandler, etc).
If it is correct, it would be nice to enable the use of specific
ContentHandlers within tika-server and I would like to discuss how to solve
this issue generally.

I propose two solutions:

    1. augment the TikaConfig class so that a specific ContentHandler can be
    used in tika-config.xml;
    2. determine the ContentHandler to use for parsing through HTTP headers,
    for example:
    curl -T filename.pdf http://localhost:9998/meta --header
    "X-Content-Handler: PhoneExtractingContentHandler"
    This should affect also the TikaResource.java class.

I look forward to having your feedback. I strongly believe that every user who wants to use Tika as a service through tika-server and needs to extract content and metadata like phone numbers, standard references, etc would be
very happy.

Thanks a lot,
Giuseppe



--
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Reply via email to