This makes sense to me, +1 Giuseppe!


On 10/24/17, 6:12 PM, "Giuseppe Totaro" <[email protected]> wrote:

    Hi folks,
    
    I am developing the proposed solutions within tika-server for enabling
    specific ContentHandlers. Basically, I am working to provide the ability of
    giving the name of the ContentHandler to be used by either command-line or
    HTTP header.
    In order to complete my work, I would like to get your feedback about the
    following aspects:
    
       1. To create and use the given ContentHandler, should I modify each
       method within the TikaResource class (as well as the other classes
       within org.apache.tika.server.resource) where the parse method is
       performed by wrapping the ContentHandler currently used? Alternatively, I
       could create a new method (therefore a new REST API) specifically focused
       on creating a ContentHandler from the list provided by the user. Of 
course,
       I am totally open to other solutions.
    
       2. As ContentHandlers often provide different types of constructors, we
       would need a mechanism to determine via reflection the constructor and 
the
       parameters to be used. I think we could get the ContentHandler by using 
the
       static method Class.forName(String className) [0] with the
       fully-qualified name of the given class and then using the method
    getConstructor(Class<?>...
       parameterTypes) [1] to determine the constructor to be used and
       instantiates the ContentHandler.
    
       3. If you agree with the above, I think that we can allow users to
       provide the parameters according to RCFC822 [3] so that they can give the
       name of the ContentHandler to be used and the parameter as a
       semicolon-separated list of entries:
    
       <header> = X-Content-Handler: <entry> *[, <entry>]
       <entry> = <content handler> *[; <param>]
       <param> = <java type> = <value>
    
       Consistently, I would enable the same syntax when using the command-line
       option:
    
       java -jar tika-server-X.jar -contentHandler <entry>*[,<entry>]
    
    I look forward to having your feedback.
    
    Thanks a lot,
    Giuseppe
    
    [0]
    
https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#forName-java.lang.String-
    [1]
    
https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#getConstructor-java.lang.Class...-
    [3] https://www.w3.org/Protocols/rfc822/
    
    On Fri, Oct 6, 2017 at 3:06 PM, Sergey Beryozkin <[email protected]>
    wrote:
    
    > Konstantin, by the way, if you are interested in having a good discussion
    > to do with using the serialized lambdas then you will be welcome to 
comment
    > on the relevant text in the Tika Concerns Beam thread, though may be Beam
    > knows how to take care of the issues you raised...
    >
    > Thanks, Sergey
    >
    > On 06/10/17 18:27, Sergey Beryozkin wrote:
    >
    >> On 06/10/17 18:08, Konstantin Gribov wrote:
    >>
    >>> My +1 to this idea.
    >>>
    >>> IMHO, second option is more flexible. I also like Nick's suggestion 
about
    >>> using default package for handlers and interpret dot-separated string as
    >>> fqcn. Solr does similar thing and it's very convenient to use (but they
    >>> use
    >>> prefix `solr.` for their classes in predefined package and any other is
    >>> interpreted as fqcn).
    >>>
    >>> I'll add that you could allow user to pass several comma-separated
    >>> handlers
    >>> to allow build content-handler stack if user wants to.
    >>>
    >>> I would disagree with Sergey about serialized lambdas for 2 reasons:
    >>> - it's useful only for java-clients;
    >>> - it could bring very nasty bugs leading to RCE class vulnerabilities, 
so
    >>> it's very controversial from security PoV.
    >>>
    >> Sure. I was not actually suggesting to use them in Tika natively, I only
    >> referred to it as the alternative mentioned in the context of the Beam
    >> integration work
    >>
    >> Sergey
    >>
    >>>
    >>> On Thu, Sep 28, 2017 at 11:35 PM Giuseppe Totaro <[email protected]>
    >>> wrote:
    >>>
    >>> Hi folks,
    >>>>
    >>>> if I am not wrong, currently you cannot configure a specific
    >>>> ContentHandler
    >>>> while using tika-server. I mean that you can configure your own parser
    >>>> [0]
    >>>> but you cannot control which ContentHandler the parser leverages to
    >>>> extract
    >>>> text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
    >>>> StandardsExtractingContentHandler, etc).
    >>>> If it is correct, it would be nice to enable the use of specific
    >>>> ContentHandlers within tika-server and I would like to discuss how to
    >>>> solve
    >>>> this issue generally.
    >>>>
    >>>> I propose two solutions:
    >>>>
    >>>>     1. augment the TikaConfig class so that a specific ContentHandler
    >>>> can be
    >>>>     used in tika-config.xml;
    >>>>     2. determine the ContentHandler to use for parsing through HTTP
    >>>> headers,
    >>>>     for example:
    >>>>     curl -T filename.pdf http://localhost:9998/meta --header
    >>>>     "X-Content-Handler: PhoneExtractingContentHandler"
    >>>>     This should affect also the TikaResource.java class.
    >>>>
    >>>> I look forward to having your feedback. I strongly believe that every
    >>>> user
    >>>> who wants to use Tika as a service through tika-server and needs to
    >>>> extract
    >>>> content and metadata like phone numbers, standard references, etc would
    >>>> be
    >>>> very happy.
    >>>>
    >>>> Thanks a lot,
    >>>> Giuseppe
    >>>>
    >>>>
    >
    > --
    > Sergey Beryozkin
    >
    > Talend Community Coders
    > http://coders.talend.com/
    >
    


Reply via email to