Julien Nioche wrote:
Hi,

I came across the classloader issue that you mentioned but got everything to work OK by duplicating the class TikaConfiguration into the package used by my plugin. The lib tika-core goes into the main /lib dir of nutch while tika-parsers jar goes into the lib dir of the plugin. I now have a first version of the Tika plugin which does some very basic text and metadata extraction.

This is confusing. Could you please explain why various Tika parts need to be put in different places? Also, the word "duplication" raises a red flag ...


What shall we do about the HTMLParseFilters? Get the generic TikaParser to create a DOM representation and pass it to the HTMLParseFilters as it is done now? Modify the HTMLParseFilters so that they use SAX events so that we can forward them from Tika? Any other suggestions?

The benefit of using DOM tree in HTMLParseFilters is that it's easier to extract / remove parts of the tree without keeping track of the context, which is the most complicated part of working with SAX - this context tracking would have to be reimplemented in many plugins ... The downside is of course the memory footprint - but we do limit the max size of the documents elsewhere (in the protocol plugins). So I'd vote to keep using DOM for now.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to