Konstantin, by the way, if you are interested in having a good
discussion to do with using the serialized lambdas then you will be
welcome to comment on the relevant text in the Tika Concerns Beam
thread, though may be Beam knows how to take care of the issues you
raised...
Thanks, Sergey
On 06/10/17 18:27, Sergey Beryozkin wrote:
On 06/10/17 18:08, Konstantin Gribov wrote:
My +1 to this idea.
IMHO, second option is more flexible. I also like Nick's suggestion about
using default package for handlers and interpret dot-separated string as
fqcn. Solr does similar thing and it's very convenient to use (but
they use
prefix `solr.` for their classes in predefined package and any other is
interpreted as fqcn).
I'll add that you could allow user to pass several comma-separated
handlers
to allow build content-handler stack if user wants to.
I would disagree with Sergey about serialized lambdas for 2 reasons:
- it's useful only for java-clients;
- it could bring very nasty bugs leading to RCE class vulnerabilities, so
it's very controversial from security PoV.
Sure. I was not actually suggesting to use them in Tika natively, I only
referred to it as the alternative mentioned in the context of the Beam
integration work
Sergey
On Thu, Sep 28, 2017 at 11:35 PM Giuseppe Totaro <[email protected]>
wrote:
Hi folks,
if I am not wrong, currently you cannot configure a specific
ContentHandler
while using tika-server. I mean that you can configure your own
parser [0]
but you cannot control which ContentHandler the parser leverages to
extract
text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
StandardsExtractingContentHandler, etc).
If it is correct, it would be nice to enable the use of specific
ContentHandlers within tika-server and I would like to discuss how to
solve
this issue generally.
I propose two solutions:
1. augment the TikaConfig class so that a specific ContentHandler
can be
used in tika-config.xml;
2. determine the ContentHandler to use for parsing through HTTP
headers,
for example:
curl -T filename.pdf http://localhost:9998/meta --header
"X-Content-Handler: PhoneExtractingContentHandler"
This should affect also the TikaResource.java class.
I look forward to having your feedback. I strongly believe that every
user
who wants to use Tika as a service through tika-server and needs to
extract
content and metadata like phone numbers, standard references, etc
would be
very happy.
Thanks a lot,
Giuseppe
--
Sergey Beryozkin
Talend Community Coders
http://coders.talend.com/