Re: Update on Integration with Tika

Andrzej Bialecki Tue, 17 Nov 2009 04:59:35 -0800

Julien Nioche wrote:

Hi guys,

This is confusing. Could you please explain why various Tika parts
need to be put in different places?


NUTCH_HOME/lib : tika-core.jar
NUTCH_HOME/tika-plugin/lib : tika-parsers.jar

Tika being used by the core only for its Mimetype functionalities we
only need to put tika-core at the main lib level whereas the tika
plugin obviously needs the tika-parsers.jar + all the jars used
internally by Tika. Note that we could simply put everything in the
main lib dir but that would not be very elegant.  Is that more clear?


Clear, thanks.

Also, the word "duplication"  raises a red flag ...


First let me explain the classloader issue. The main class in the Tika
plugin instantiates a TikaConfig object (using Tika's XML
configuration file), which tries to load the parser classes for each
mime-type Tika knows about. Remember that we need tika-core in the
main lib directory? This is where the TikaConfig class is stored. For
some reason it is not able to find the classes in the jars located at
the plugin level even though the class instantiating TikaConfig is
itself at the plugin level. I had a look at the PluginClassLoader but
could not find anything wrong with it.

We can of course try to fix this classloader issue (which will be a
more elegant solution), but in order not to get bogged down with this
I found that having a temporary solution with a local TikaConfig
allowed us to make progress with the Tika implementation.

Is the classloader problem clear? Shall we treat it as a separate issue?


Thanks for the explanation.

Well ... let's consider this: in the past we used to put things under/lib/ when they were being used by more than a few plugins. Then westarted using library-only plugins (e.g. lib-xml, lib-nekohtml, etc).There is a mechanism that allows us to export any classes from a pluginso that they are visible to the rest of the framework.

It looks to me like we could be better off by putting all parts of Tikain a single plugin, and then in Nutch core use a new extension pointjust for the purpose of mimetype detection. This facade (MimeDetectors)would use the Tika plugin if available, or some other (null?) mechanismotherwise. At the same time Tika would be happy to configure itselfhaving all tika-core and parsers available under the same classloader,and it would define two extension points - one for mimetype detection,and another for parsing. What do you think?

The benefit of using DOM tree in HTMLParseFilters is that it's
easier to extract / remove parts of the tree without keeping track
of the context, which is the most complicated part of working with
SAX - this context tracking would have to be reimplemented in many
plugins ... The downside is of course the memory footprint - but we
do limit the max size of the documents elsewhere (in the protocol
plugins). So I'd vote to keep using DOM for now.

With web mining, you absolutely need to be able to access the context
of the complete DOM.


I agree with you both. Maybe we could delegate the building of the DOM
object to the class HTMLParseFilters so that it is done only if there
are HTMLParseFilter implementation to be used.

Hmmmm .. let's rephrase the question: who would use the non-DOMinterface? If nobody, then maybe we only need the DOM.

A related question is : shall we build the DOM representation from the
original HTML or from the XHTML returned by Tika? I would be inclined
to the latter as this could potentially allow us to do the same with
non HTML documents as well as Tika converts their original markup into
XHTML.


I think we should use the XHTML from Tika.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Update on Integration with Tika

Reply via email to