[
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798727#action_12798727
]
Julien Nioche commented on NUTCH-766:
-------------------------------------
Hi Chris,
No worries, I'd rather wait for you to have a look at it. It's quite a big
change and it would be better if someone else had a look at it. Being the
author I might miss something obvious
Thanks
J.
> Tika parser
> -----------
>
> Key: NUTCH-766
> URL: https://issues.apache.org/jira/browse/NUTCH-766
> Project: Nutch
> Issue Type: New Feature
> Reporter: Julien Nioche
> Assignee: Chris A. Mattmann
> Fix For: 1.1
>
> Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch
>
>
> Tika handles a lot of different formats under the bonnet and exposes them
> nicely via SAX events. What is described here is a tika-parser plugin which
> delegates the pasring mechanism of Tika but can still coexist with the
> existing parsing plugins which is useful for formats partially handled by
> Tika (or not at all). Some of the elements below have already been discussed
> on the mailing lists. Note that this is work in progress, your feedback is
> welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as
> different jar files (core and parsers), in the work described here we decided
> to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only
> need to put tika-core at the main lib level whereas the tika plugin obviously
> needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the
> TikaConfig class in the tika-plugin. This might be fixed in the future in
> Tika itself or avoided by refactoring the mimetype part of Nutch using
> extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why
> we are using "*" as its mimetype value in the plugin descriptor and have
> modified ParserFactory.java so that it considers the tika parser as
> potentially suitable for all mime-types. In practice this means that the
> associations between a mime type and a parser plugin as defined in
> parse-plugins.xml are useful only for the cases where we want to handle a
> mime type with a different parser than Tika.
> The general approach I chose was to convert the SAX events returned by the
> Tika parsers into DOM objects and reuse the utilities that come with the
> current HTML parser i.e. link detection, metatag handling but also means
> that we can use the HTMLParseFilters in exactly the same way. The main
> difference though is that HTMLParseFilters are not limited to HTML documents
> anymore as the XHTML tags returned by Tika can correspond to a different
> format for the original document. There is a duplication of code with the
> html-plugin which will be resolved by either a) getting rid of the
> html-plugin altogether or b) exporting its jar and make the tika parser
> depend on it.
> The following libraries are required in the lib/ directory of the tika-parser
> :
> <library name="asm-3.1.jar"/>
> <library name="bcmail-jdk15-144.jar"/>
> <library name="commons-compress-1.0.jar"/>
> <library name="commons-logging-1.1.1.jar"/>
> <library name="dom4j-1.6.1.jar"/>
> <library name="fontbox-0.8.0-incubator.jar"/>
> <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
> <library name="hamcrest-core-1.1.jar"/>
> <library name="jce-jdk13-144.jar"/>
> <library name="jempbox-0.8.0-incubator.jar"/>
> <library name="metadata-extractor-2.4.0-beta-1.jar"/>
> <library name="mockito-core-1.7.jar"/>
> <library name="objenesis-1.0.jar"/>
> <library name="ooxml-schemas-1.0.jar"/>
> <library name="pdfbox-0.8.0-incubating.jar"/>
> <library name="poi-3.5-FINAL.jar"/>
> <library name="poi-ooxml-3.5-FINAL.jar"/>
> <library name="poi-scratchpad-3.5-FINAL.jar"/>
> <library name="tagsoup-1.2.jar"/>
> <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
> <library name="xml-apis-1.0.b2.jar"/>
> <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have
> a look at each individual format and check that it is covered by Tika and if
> so to the same extent; the Wiki is probably the right place for this. The
> language identifier (which is a HTMLParseFilter) seemed to work fine.
>
> Again, your comments are welcome. Please bear in mind that this is just a
> first step.
> Julien
> http://www.digitalpebble.com
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.