Hi folks, I am developing the proposed solutions within tika-server for enabling specific ContentHandlers. Basically, I am working to provide the ability of giving the name of the ContentHandler to be used by either command-line or HTTP header. In order to complete my work, I would like to get your feedback about the following aspects:
1. To create and use the given ContentHandler, should I modify each method within the TikaResource class (as well as the other classes within org.apache.tika.server.resource) where the parse method is performed by wrapping the ContentHandler currently used? Alternatively, I could create a new method (therefore a new REST API) specifically focused on creating a ContentHandler from the list provided by the user. Of course, I am totally open to other solutions. 2. As ContentHandlers often provide different types of constructors, we would need a mechanism to determine via reflection the constructor and the parameters to be used. I think we could get the ContentHandler by using the static method Class.forName(String className) [0] with the fully-qualified name of the given class and then using the method getConstructor(Class<?>... parameterTypes) [1] to determine the constructor to be used and instantiates the ContentHandler. 3. If you agree with the above, I think that we can allow users to provide the parameters according to RCFC822 [3] so that they can give the name of the ContentHandler to be used and the parameter as a semicolon-separated list of entries: <header> = X-Content-Handler: <entry> *[, <entry>] <entry> = <content handler> *[; <param>] <param> = <java type> = <value> Consistently, I would enable the same syntax when using the command-line option: java -jar tika-server-X.jar -contentHandler <entry>*[,<entry>] I look forward to having your feedback. Thanks a lot, Giuseppe [0] https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#forName-java.lang.String- [1] https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#getConstructor-java.lang.Class...- [3] https://www.w3.org/Protocols/rfc822/ On Fri, Oct 6, 2017 at 3:06 PM, Sergey Beryozkin <[email protected]> wrote: > Konstantin, by the way, if you are interested in having a good discussion > to do with using the serialized lambdas then you will be welcome to comment > on the relevant text in the Tika Concerns Beam thread, though may be Beam > knows how to take care of the issues you raised... > > Thanks, Sergey > > On 06/10/17 18:27, Sergey Beryozkin wrote: > >> On 06/10/17 18:08, Konstantin Gribov wrote: >> >>> My +1 to this idea. >>> >>> IMHO, second option is more flexible. I also like Nick's suggestion about >>> using default package for handlers and interpret dot-separated string as >>> fqcn. Solr does similar thing and it's very convenient to use (but they >>> use >>> prefix `solr.` for their classes in predefined package and any other is >>> interpreted as fqcn). >>> >>> I'll add that you could allow user to pass several comma-separated >>> handlers >>> to allow build content-handler stack if user wants to. >>> >>> I would disagree with Sergey about serialized lambdas for 2 reasons: >>> - it's useful only for java-clients; >>> - it could bring very nasty bugs leading to RCE class vulnerabilities, so >>> it's very controversial from security PoV. >>> >> Sure. I was not actually suggesting to use them in Tika natively, I only >> referred to it as the alternative mentioned in the context of the Beam >> integration work >> >> Sergey >> >>> >>> On Thu, Sep 28, 2017 at 11:35 PM Giuseppe Totaro <[email protected]> >>> wrote: >>> >>> Hi folks, >>>> >>>> if I am not wrong, currently you cannot configure a specific >>>> ContentHandler >>>> while using tika-server. I mean that you can configure your own parser >>>> [0] >>>> but you cannot control which ContentHandler the parser leverages to >>>> extract >>>> text and metadata (e.g., you cannot use PhoneExtractingContentHandler, >>>> StandardsExtractingContentHandler, etc). >>>> If it is correct, it would be nice to enable the use of specific >>>> ContentHandlers within tika-server and I would like to discuss how to >>>> solve >>>> this issue generally. >>>> >>>> I propose two solutions: >>>> >>>> 1. augment the TikaConfig class so that a specific ContentHandler >>>> can be >>>> used in tika-config.xml; >>>> 2. determine the ContentHandler to use for parsing through HTTP >>>> headers, >>>> for example: >>>> curl -T filename.pdf http://localhost:9998/meta --header >>>> "X-Content-Handler: PhoneExtractingContentHandler" >>>> This should affect also the TikaResource.java class. >>>> >>>> I look forward to having your feedback. I strongly believe that every >>>> user >>>> who wants to use Tika as a service through tika-server and needs to >>>> extract >>>> content and metadata like phone numbers, standard references, etc would >>>> be >>>> very happy. >>>> >>>> Thanks a lot, >>>> Giuseppe >>>> >>>> > > -- > Sergey Beryozkin > > Talend Community Coders > http://coders.talend.com/ >
