Re: Update on Integration with Tika

Andrzej Bialecki Mon, 16 Nov 2009 12:00:50 -0800

Julien Nioche wrote:

Hi,
I came across the classloader issue that you mentioned but goteverything to work OK by duplicating the class TikaConfiguration intothe package used by my plugin. The lib tika-core goes into the main /libdir of nutch while tika-parsers jar goes into the lib dir of the plugin.I now have a first version of the Tika plugin which does some very basictext and metadata extraction.

This is confusing. Could you please explain why various Tika parts needto be put in different places? Also, the word "duplication" raises a redflag ...

What shall we do about the HTMLParseFilters? Get the generic TikaParserto create a DOM representation and pass it to the HTMLParseFilters as itis done now? Modify the HTMLParseFilters so that they use SAX events sothat we can forward them from Tika? Any other suggestions?

The benefit of using DOM tree in HTMLParseFilters is that it's easier toextract / remove parts of the tree without keeping track of the context,which is the most complicated part of working with SAX - this contexttracking would have to be reimplemented in many plugins ... The downsideis of course the memory footprint - but we do limit the max size of thedocuments elsewhere (in the protocol plugins). So I'd vote to keep usingDOM for now.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Update on Integration with Tika

Reply via email to