[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832255#action_12832255 ]
Chris A. Mattmann commented on NUTCH-766: ----------------------------------------- {quote} +1 to commit this... {quote} Awesome, Andrzej. Will do so tonight, PST, if I don't hear any objections between now and then... Thanks! Cheers, Chris > Tika parser > ----------- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature > Reporter: Julien Nioche > Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > <library name="asm-3.1.jar"/> > <library name="bcmail-jdk15-144.jar"/> > <library name="commons-compress-1.0.jar"/> > <library name="commons-logging-1.1.1.jar"/> > <library name="dom4j-1.6.1.jar"/> > <library name="fontbox-0.8.0-incubator.jar"/> > <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/> > <library name="hamcrest-core-1.1.jar"/> > <library name="jce-jdk13-144.jar"/> > <library name="jempbox-0.8.0-incubator.jar"/> > <library name="metadata-extractor-2.4.0-beta-1.jar"/> > <library name="mockito-core-1.7.jar"/> > <library name="objenesis-1.0.jar"/> > <library name="ooxml-schemas-1.0.jar"/> > <library name="pdfbox-0.8.0-incubating.jar"/> > <library name="poi-3.5-FINAL.jar"/> > <library name="poi-ooxml-3.5-FINAL.jar"/> > <library name="poi-scratchpad-3.5-FINAL.jar"/> > <library name="tagsoup-1.2.jar"/> > <library name="tika-parsers-0.5-SNAPSHOT.jar"/> > <library name="xml-apis-1.0.b2.jar"/> > <library name="xmlbeans-2.3.0.jar"/> > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.