Hi Chris
Another option (for Beam) was passing a custom content handler via the
serialized lambda expression - which sounds like a black magic to ne at
the moment but I'm curious :-)
I thought, assuming TikaConfig is only used once to bootstrap, then
passing a ContentHandler class name might work. You are right, the
headers are better suited for the dynamic cases...
Cheers, Sergey
On 28/09/17 22:35, Chris Mattmann wrote:
Hmm, cool.
Can we support both? If I don’t have to modify/ship a Tika config (which is a
runtime
configuration) and I can, on a per call invocation, change the ContentHandler,
it would
be MUCH easier in downstream libraries like Tika Python that rely on the REST
server.
These are documented here:
https://wiki.apache.org/tika/API%20Bindings%20for%20Tika
Cheers,
Chris
On 9/28/17, 2:26 PM, "Sergey Beryozkin" <[email protected]> wrote:
Hi
Option #1 is also good - a question how to pass a ContentHandler to a
Beam function was open, and given that passing TikaConfig is needed
anyway, having a way to specify a handler there can be handy too...
Cheers, Sergey
On 28/09/17 22:17, Chris Mattmann wrote:
> I am +1 for this. Option #2 sounds like a slick way to handle this for
me that would
> remain back compat with tika-python which is of strong interest to me.
>
> Cheers,
> Chris
>
>
>
>
> On 9/28/17, 1:35 PM, "Giuseppe Totaro" <[email protected]> wrote:
>
> Hi folks,
>
> if I am not wrong, currently you cannot configure a specific
ContentHandler
> while using tika-server. I mean that you can configure your own
parser [0]
> but you cannot control which ContentHandler the parser leverages to
extract
> text and metadata (e.g., you cannot use
PhoneExtractingContentHandler,
> StandardsExtractingContentHandler, etc).
> If it is correct, it would be nice to enable the use of specific
> ContentHandlers within tika-server and I would like to discuss how
to solve
> this issue generally.
>
> I propose two solutions:
>
> 1. augment the TikaConfig class so that a specific
ContentHandler can be
> used in tika-config.xml;
> 2. determine the ContentHandler to use for parsing through HTTP
headers,
> for example:
> curl -T filename.pdf http://localhost:9998/meta --header
> "X-Content-Handler: PhoneExtractingContentHandler"
> This should affect also the TikaResource.java class.
>
> I look forward to having your feedback. I strongly believe that
every user
> who wants to use Tika as a service through tika-server and needs to
extract
> content and metadata like phone numbers, standard references, etc
would be
> very happy.
>
> Thanks a lot,
> Giuseppe
>
>
>