[jira] Commented: (NUTCH-766) Tika parser

Julien Nioche (JIRA) Thu, 11 Feb 2010 03:41:03 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832454#action_12832454
 ]


Julien Nioche commented on NUTCH-766:
-------------------------------------

@Chris : I just did a fresh co from svn, applied the patch v3 and unzipped 
sample.tar.gz onto  the directory parse-tika and ran the test just as you did 
but could not reproduce the problem.  Could there be a difference between your 
version and the trunk?

@Sami :  

{quote} was there a reason not to use AutoDetect parser?  {quote} 
I suppose we could as long we give it a clue about the MimeType obtained from 
the Content.  As you pointed out, there could be a duplication with the 
detection done by Mime-Util. I suppose one way to do would be to add a new 
version of the method getParse(Content conte, MimeType type). That's an 
interesting point.

{quote} Also was there a reson not to parse html wtih tika?  {quote} 
It is supposed to do so, if it does not then it's a bug which needs urgent 
fixing.

Regarding parsing package formats, I think the plan is that Tika will handle 
that in the future but we could try to do that now if we find a relatively 
clean mechanism for doing so. BTW could you please send a diff and not the full 
code of the class you posted earlier, that would make the comparison much 
easier.




> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, 
> sample.tar.gz, TikaParser.java
>
>
> Tika handles a lot of different formats under the bonnet and exposes them 
> nicely via SAX events. What is described here is a tika-parser plugin which 
> delegates the pasring mechanism of Tika but can still coexist with the 
> existing parsing plugins which is useful for formats partially handled by 
> Tika (or not at all). Some of the elements below have already been discussed 
> on the mailing lists. Note that this is work in progress, your feedback is 
> welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as 
> different jar files (core and parsers), in the work described here we decided 
> to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only 
> need to put tika-core at the main lib level whereas the tika plugin obviously 
> needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the 
> TikaConfig class in the tika-plugin. This might be fixed in the future in 
> Tika itself or avoided by refactoring the mimetype part of Nutch using 
> extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why 
> we are using "*" as its mimetype value in the plugin descriptor and have 
> modified ParserFactory.java so that it considers the tika parser as 
> potentially suitable for all mime-types. In practice this means that the 
> associations between a mime type and a parser plugin as defined in 
> parse-plugins.xml are useful only for the cases where we want to handle a 
> mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the 
> Tika parsers into DOM objects and reuse the utilities that come with the 
> current HTML parser i.e. link detection,  metatag handling but also means 
> that we can use the HTMLParseFilters in exactly the same way. The main 
> difference though is that HTMLParseFilters are not limited to HTML documents 
> anymore as the XHTML tags returned by Tika can correspond to a different 
> format for the original document. There is a duplication of code with the 
> html-plugin which will be resolved by either a) getting rid of the 
> html-plugin altogether or b) exporting its jar and make the tika parser 
> depend on it.
> The following libraries are required in the lib/ directory of the tika-parser 
> : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have 
> a look at each individual format and check that it is covered by Tika and if 
> so to the same extent; the Wiki is probably the right place for this. The 
> language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a 
> first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-766) Tika parser

Reply via email to