[jira] Commented: (NUTCH-766) Tika parser

Julien Nioche (JIRA) Mon, 11 Jan 2010 08:59:24 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798727#action_12798727
 ]


Julien Nioche commented on NUTCH-766:
-------------------------------------

Hi Chris, 

No worries, I'd rather wait for you to have a look at it. It's quite a big 
change and it would be better if someone else had a look at it. Being the 
author I might miss something obvious

Thanks

J.

> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch
>
>
> Tika handles a lot of different formats under the bonnet and exposes them 
> nicely via SAX events. What is described here is a tika-parser plugin which 
> delegates the pasring mechanism of Tika but can still coexist with the 
> existing parsing plugins which is useful for formats partially handled by 
> Tika (or not at all). Some of the elements below have already been discussed 
> on the mailing lists. Note that this is work in progress, your feedback is 
> welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as 
> different jar files (core and parsers), in the work described here we decided 
> to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only 
> need to put tika-core at the main lib level whereas the tika plugin obviously 
> needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the 
> TikaConfig class in the tika-plugin. This might be fixed in the future in 
> Tika itself or avoided by refactoring the mimetype part of Nutch using 
> extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why 
> we are using "*" as its mimetype value in the plugin descriptor and have 
> modified ParserFactory.java so that it considers the tika parser as 
> potentially suitable for all mime-types. In practice this means that the 
> associations between a mime type and a parser plugin as defined in 
> parse-plugins.xml are useful only for the cases where we want to handle a 
> mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the 
> Tika parsers into DOM objects and reuse the utilities that come with the 
> current HTML parser i.e. link detection,  metatag handling but also means 
> that we can use the HTMLParseFilters in exactly the same way. The main 
> difference though is that HTMLParseFilters are not limited to HTML documents 
> anymore as the XHTML tags returned by Tika can correspond to a different 
> format for the original document. There is a duplication of code with the 
> html-plugin which will be resolved by either a) getting rid of the 
> html-plugin altogether or b) exporting its jar and make the tika parser 
> depend on it.
> The following libraries are required in the lib/ directory of the tika-parser 
> : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have 
> a look at each individual format and check that it is covered by Tika and if 
> so to the same extent; the Wiki is probably the right place for this. The 
> language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a 
> first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-766) Tika parser

Reply via email to