Re: Update on Integration with Tika

Julien Nioche Tue, 17 Nov 2009 01:24:51 -0800

Hi guys,

>> This is confusing. Could you please explain why various Tika parts
>> need to be put in different places?


NUTCH_HOME/lib : tika-core.jar
NUTCH_HOME/tika-plugin/lib : tika-parsers.jar

Tika being used by the core only for its Mimetype functionalities we
only need to put tika-core at the main lib level whereas the tika
plugin obviously needs the tika-parsers.jar + all the jars used
internally by Tika. Note that we could simply put everything in the
main lib dir but that would not be very elegant.  Is that more clear?

>> Also, the word "duplication"  raises a red flag ...

First let me explain the classloader issue. The main class in the Tika
plugin instantiates a TikaConfig object (using Tika's XML
configuration file), which tries to load the parser classes for each
mime-type Tika knows about. Remember that we need tika-core in the
main lib directory? This is where the TikaConfig class is stored. For
some reason it is not able to find the classes in the jars located at
the plugin level even though the class instantiating TikaConfig is
itself at the plugin level. I had a look at the PluginClassLoader but
could not find anything wrong with it.

We can of course try to fix this classloader issue (which will be a
more elegant solution), but in order not to get bogged down with this
I found that having a temporary solution with a local TikaConfig
allowed us to make progress with the Tika implementation.

Is the classloader problem clear? Shall we treat it as a separate issue?


>> The benefit of using DOM tree in HTMLParseFilters is that it's
>> easier to extract / remove parts of the tree without keeping track
>> of the context, which is the most complicated part of working with
>> SAX - this context tracking would have to be reimplemented in many
>> plugins ... The downside is of course the memory footprint - but we
>> do limit the max size of the documents elsewhere (in the protocol
>> plugins). So I'd vote to keep using DOM for now.
>
> With web mining, you absolutely need to be able to access the context
> of the complete DOM.

I agree with you both. Maybe we could delegate the building of the DOM
object to the class HTMLParseFilters so that it is done only if there
are HTMLParseFilter implementation to be used.

A related question is : shall we build the DOM representation from the
original HTML or from the XHTML returned by Tika? I would be inclined
to the latter as this could potentially allow us to do the same with
non HTML documents as well as Tika converts their original markup into
XHTML.

Have a nice day

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Re: Update on Integration with Tika

Reply via email to