Hi guys, >> This is confusing. Could you please explain why various Tika parts >> need to be put in different places?
NUTCH_HOME/lib : tika-core.jar NUTCH_HOME/tika-plugin/lib : tika-parsers.jar Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika. Note that we could simply put everything in the main lib dir but that would not be very elegant. Is that more clear? >> Also, the word "duplication" raises a red flag ... First let me explain the classloader issue. The main class in the Tika plugin instantiates a TikaConfig object (using Tika's XML configuration file), which tries to load the parser classes for each mime-type Tika knows about. Remember that we need tika-core in the main lib directory? This is where the TikaConfig class is stored. For some reason it is not able to find the classes in the jars located at the plugin level even though the class instantiating TikaConfig is itself at the plugin level. I had a look at the PluginClassLoader but could not find anything wrong with it. We can of course try to fix this classloader issue (which will be a more elegant solution), but in order not to get bogged down with this I found that having a temporary solution with a local TikaConfig allowed us to make progress with the Tika implementation. Is the classloader problem clear? Shall we treat it as a separate issue? >> The benefit of using DOM tree in HTMLParseFilters is that it's >> easier to extract / remove parts of the tree without keeping track >> of the context, which is the most complicated part of working with >> SAX - this context tracking would have to be reimplemented in many >> plugins ... The downside is of course the memory footprint - but we >> do limit the max size of the documents elsewhere (in the protocol >> plugins). So I'd vote to keep using DOM for now. > > With web mining, you absolutely need to be able to access the context > of the complete DOM. I agree with you both. Maybe we could delegate the building of the DOM object to the class HTMLParseFilters so that it is done only if there are HTMLParseFilter implementation to be used. A related question is : shall we build the DOM representation from the original HTML or from the XHTML returned by Tika? I would be inclined to the latter as this could potentially allow us to do the same with non HTML documents as well as Tika converts their original markup into XHTML. Have a nice day Julien -- DigitalPebble Ltd http://www.digitalpebble.com