Hi folks,

I am developing the proposed solutions within tika-server for enabling
specific ContentHandlers. Basically, I am working to provide the ability of
giving the name of the ContentHandler to be used by either command-line or
HTTP header.
In order to complete my work, I would like to get your feedback about the
following aspects:

   1. To create and use the given ContentHandler, should I modify each
   method within the TikaResource class (as well as the other classes
   within org.apache.tika.server.resource) where the parse method is
   performed by wrapping the ContentHandler currently used? Alternatively, I
   could create a new method (therefore a new REST API) specifically focused
   on creating a ContentHandler from the list provided by the user. Of course,
   I am totally open to other solutions.

   2. As ContentHandlers often provide different types of constructors, we
   would need a mechanism to determine via reflection the constructor and the
   parameters to be used. I think we could get the ContentHandler by using the
   static method Class.forName(String className) [0] with the
   fully-qualified name of the given class and then using the method
getConstructor(Class<?>...
   parameterTypes) [1] to determine the constructor to be used and
   instantiates the ContentHandler.

   3. If you agree with the above, I think that we can allow users to
   provide the parameters according to RCFC822 [3] so that they can give the
   name of the ContentHandler to be used and the parameter as a
   semicolon-separated list of entries:

   <header> = X-Content-Handler: <entry> *[, <entry>]
   <entry> = <content handler> *[; <param>]
   <param> = <java type> = <value>

   Consistently, I would enable the same syntax when using the command-line
   option:

   java -jar tika-server-X.jar -contentHandler <entry>*[,<entry>]

I look forward to having your feedback.

Thanks a lot,
Giuseppe

[0]
https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#forName-java.lang.String-
[1]
https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#getConstructor-java.lang.Class...-
[3] https://www.w3.org/Protocols/rfc822/

On Fri, Oct 6, 2017 at 3:06 PM, Sergey Beryozkin <[email protected]>
wrote:

> Konstantin, by the way, if you are interested in having a good discussion
> to do with using the serialized lambdas then you will be welcome to comment
> on the relevant text in the Tika Concerns Beam thread, though may be Beam
> knows how to take care of the issues you raised...
>
> Thanks, Sergey
>
> On 06/10/17 18:27, Sergey Beryozkin wrote:
>
>> On 06/10/17 18:08, Konstantin Gribov wrote:
>>
>>> My +1 to this idea.
>>>
>>> IMHO, second option is more flexible. I also like Nick's suggestion about
>>> using default package for handlers and interpret dot-separated string as
>>> fqcn. Solr does similar thing and it's very convenient to use (but they
>>> use
>>> prefix `solr.` for their classes in predefined package and any other is
>>> interpreted as fqcn).
>>>
>>> I'll add that you could allow user to pass several comma-separated
>>> handlers
>>> to allow build content-handler stack if user wants to.
>>>
>>> I would disagree with Sergey about serialized lambdas for 2 reasons:
>>> - it's useful only for java-clients;
>>> - it could bring very nasty bugs leading to RCE class vulnerabilities, so
>>> it's very controversial from security PoV.
>>>
>> Sure. I was not actually suggesting to use them in Tika natively, I only
>> referred to it as the alternative mentioned in the context of the Beam
>> integration work
>>
>> Sergey
>>
>>>
>>> On Thu, Sep 28, 2017 at 11:35 PM Giuseppe Totaro <[email protected]>
>>> wrote:
>>>
>>> Hi folks,
>>>>
>>>> if I am not wrong, currently you cannot configure a specific
>>>> ContentHandler
>>>> while using tika-server. I mean that you can configure your own parser
>>>> [0]
>>>> but you cannot control which ContentHandler the parser leverages to
>>>> extract
>>>> text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
>>>> StandardsExtractingContentHandler, etc).
>>>> If it is correct, it would be nice to enable the use of specific
>>>> ContentHandlers within tika-server and I would like to discuss how to
>>>> solve
>>>> this issue generally.
>>>>
>>>> I propose two solutions:
>>>>
>>>>     1. augment the TikaConfig class so that a specific ContentHandler
>>>> can be
>>>>     used in tika-config.xml;
>>>>     2. determine the ContentHandler to use for parsing through HTTP
>>>> headers,
>>>>     for example:
>>>>     curl -T filename.pdf http://localhost:9998/meta --header
>>>>     "X-Content-Handler: PhoneExtractingContentHandler"
>>>>     This should affect also the TikaResource.java class.
>>>>
>>>> I look forward to having your feedback. I strongly believe that every
>>>> user
>>>> who wants to use Tika as a service through tika-server and needs to
>>>> extract
>>>> content and metadata like phone numbers, standard references, etc would
>>>> be
>>>> very happy.
>>>>
>>>> Thanks a lot,
>>>> Giuseppe
>>>>
>>>>
>
> --
> Sergey Beryozkin
>
> Talend Community Coders
> http://coders.talend.com/
>

Reply via email to