Tika parser
-----------

                 Key: NUTCH-766
                 URL: https://issues.apache.org/jira/browse/NUTCH-766
             Project: Nutch
          Issue Type: New Feature
            Reporter: Julien Nioche


Tika handles a lot of different formats under the bonnet and exposes them 
nicely via SAX events. What is described here is a tika-parser plugin which 
delegates the pasring mechanism of Tika but can still coexist with the existing 
parsing plugins which is useful for formats partially handled by Tika (or not 
at all). Some of the elements below have already been discussed on the mailing 
lists. Note that this is work in progress, your feedback is welcome.

Tika is already used by Nutch for its MimeType implementations. Tika comes as 
different jar files (core and parsers), in the work described here we decided 
to put the libs in 2 different places
NUTCH_HOME/lib : tika-core.jar
NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
Tika being used by the core only for its Mimetype functionalities we only need 
to put tika-core at the main lib level whereas the tika plugin obviously needs 
the tika-parsers.jar + all the jars used internally by Tika

Due to limitations in the way Tika loads its classes, we had to duplicate the 
TikaConfig class in the tika-plugin. This might be fixed in the future in Tika 
itself or avoided by refactoring the mimetype part of Nutch using extension 
points.

Unlike most other parsers, Tika handles more than one Mime-type which is why we 
are using "*" as its mimetype value in the plugin descriptor and have modified 
ParserFactory.java so that it considers the tika parser as potentially suitable 
for all mime-types. In practice this means that the associations between a mime 
type and a parser plugin as defined in parse-plugins.xml are useful only for 
the cases where we want to handle a mime type with a different parser than 
Tika. 

The general approach I chose was to convert the SAX events returned by the Tika 
parsers into DOM objects and reuse the utilities that come with the current 
HTML parser i.e. link detection,  metatag handling but also means that we can 
use the HTMLParseFilters in exactly the same way. The main difference though is 
that HTMLParseFilters are not limited to HTML documents anymore as the XHTML 
tags returned by Tika can correspond to a different format for the original 
document. There is a duplication of code with the html-plugin which will be 
resolved by either a) getting rid of the html-plugin altogether or b) exporting 
its jar and make the tika parser depend on it.

The following libraries are required in the lib/ directory of the tika-parser : 

      <library name="asm-3.1.jar"/>
      <library name="bcmail-jdk15-144.jar"/>
      <library name="commons-compress-1.0.jar"/>
      <library name="commons-logging-1.1.1.jar"/>
      <library name="dom4j-1.6.1.jar"/>
      <library name="fontbox-0.8.0-incubator.jar"/>
      <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
      <library name="hamcrest-core-1.1.jar"/>
      <library name="jce-jdk13-144.jar"/>
      <library name="jempbox-0.8.0-incubator.jar"/>
      <library name="metadata-extractor-2.4.0-beta-1.jar"/>
      <library name="mockito-core-1.7.jar"/>
      <library name="objenesis-1.0.jar"/>
      <library name="ooxml-schemas-1.0.jar"/>
      <library name="pdfbox-0.8.0-incubating.jar"/>
      <library name="poi-3.5-FINAL.jar"/>
      <library name="poi-ooxml-3.5-FINAL.jar"/>
      <library name="poi-scratchpad-3.5-FINAL.jar"/>
      <library name="tagsoup-1.2.jar"/>
      <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
      <library name="xml-apis-1.0.b2.jar"/>
      <library name="xmlbeans-2.3.0.jar"/>

There is a small test suite which needs to be improved. We will need to have a 
look at each individual format and check that it is covered by Tika and if so 
to the same extent; the Wiki is probably the right place for this. The language 
identifier (which is a HTMLParseFilter) seemed to work fine.
 
Again, your comments are welcome. Please bear in mind that this is just a first 
step. 

Julien
http://www.digitalpebble.com





-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to