[jira] Issue Comment Edited: (NUTCH-766) Tika parser

Julien Nioche (JIRA) Thu, 11 Feb 2010 09:23:04 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832564#action_12832564
 ]


Julien Nioche edited comment on NUTCH-766 at 2/11/10 5:22 PM:
--------------------------------------------------------------

I had a closer look at the HTML parsing issue. What happens  is that the 
association between the mime-type and the parser implementation is not 
explicitely set in parse-plugins.xml so the ParserFactory goes through all the 
plugins and gets the ones with a matching mimetype (or * for Tika). The Tika 
parser takes no precedence over the default HTML parser and the latter gets 
first in the list and is used for parsing.

Of course that does not happen if parse-html is not specified in 
plugin.includes or if an explicit mapping is set in parse-plugins.xml.  I don't 
think we want to have to specify explicitely that tika should be used in all 
the mappings and reserve cases for when a parser must be used instead of Tika.

What we could do though is that in the cases where no explicit mapping is set 
for a mimetype, Tika (or any parser marked as supporting any mimetype) will be 
put first in the list of discovered parsers so it would remain the default 
choice unless an explicit mapping is set (even if a plugin is loaded and can 
handle the type).

Makes sense?

The ParserFactory section of the patch v3 can be replaced by :  

Index: src/java/org/apache/nutch/parse/ParserFactory.java
===================================================================
--- src/java/org/apache/nutch/parse/ParserFactory.java  (revision 909059)
+++ src/java/org/apache/nutch/parse/ParserFactory.java  (working copy)
@@ -348,11 +348,23 @@
                 contentType)) {
           extList.add(extensions[i]);
         }
+        else if ("*".equals(extensions[i].getAttribute("contentType"))){
+          // default plugins get the priority
+          extList.add(0, extensions[i]);
+        }
       }
       
       if (extList.size() > 0) {
         if (LOG.isInfoEnabled()) {
-          LOG.info("The parsing plugins: " + extList +
+          StringBuffer extensionsIDs = new StringBuffer("[");
+          boolean isFirst = true;
+          for (Extension ext : extList){
+                 if (!isFirst) extensionsIDs.append(" - ");
+                 else isFirst=false;
+                 extensionsIDs.append(ext.getId());
+          }
+         extensionsIDs.append("]");
+          LOG.info("The parsing plugins: " + extensionsIDs.toString() +
                    " are enabled via the plugin.includes system " +
                    "property, and all claim to support the content type " +
                    contentType + ", but they are not mapped to it  in the " +
@@ -369,7 +381,7 @@
 
   private boolean match(Extension extension, String id, String type) {
     return ((id.equals(extension.getId())) &&
-            (type.equals(extension.getAttribute("contentType")) ||
+            (type.equals(extension.getAttribute("contentType")) || 
extension.getAttribute("contentType").equals("*") ||
              type.equals(DEFAULT_PLUGIN)));
   }
   



      was (Author: jnioche):
    I had a closer look at the HTML parsing issue. What happens  is that the 
association between the mime-type and the parser implementation is not 
explicitely set in parse-plugins.xml so the ParserFactory goes through all the 
plugins and gets the ones with a matching mimetype (or * for Tika). The Tika 
parser takes no precedence over the default HTML parser and the latter gets 
first in the list and is used for parsing.

Of course that does not happen if parse-html is not specified in 
plugin.includes or if an explicit mapping is set in parse-plugins.xml.  I don't 
think we want to have to specify explicitely that tika should be used in all 
the mappings and reserve cases for when a parser must be used instead of Tika.

What we could do though is that in the cases where no explicit mapping is set 
for a mimetype, Tika (or any parser marked as supporting any mimetype) will be 
put first in the list of discovered parsers so it would remain the default 
choice unless an explicit mapping is set (even if a plugin is loaded and can 
handle the type).

Makes sense?


  
> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, 
> sample.tar.gz, TikaParser.java
>
>
> Tika handles a lot of different formats under the bonnet and exposes them 
> nicely via SAX events. What is described here is a tika-parser plugin which 
> delegates the pasring mechanism of Tika but can still coexist with the 
> existing parsing plugins which is useful for formats partially handled by 
> Tika (or not at all). Some of the elements below have already been discussed 
> on the mailing lists. Note that this is work in progress, your feedback is 
> welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as 
> different jar files (core and parsers), in the work described here we decided 
> to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only 
> need to put tika-core at the main lib level whereas the tika plugin obviously 
> needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the 
> TikaConfig class in the tika-plugin. This might be fixed in the future in 
> Tika itself or avoided by refactoring the mimetype part of Nutch using 
> extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why 
> we are using "*" as its mimetype value in the plugin descriptor and have 
> modified ParserFactory.java so that it considers the tika parser as 
> potentially suitable for all mime-types. In practice this means that the 
> associations between a mime type and a parser plugin as defined in 
> parse-plugins.xml are useful only for the cases where we want to handle a 
> mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the 
> Tika parsers into DOM objects and reuse the utilities that come with the 
> current HTML parser i.e. link detection,  metatag handling but also means 
> that we can use the HTMLParseFilters in exactly the same way. The main 
> difference though is that HTMLParseFilters are not limited to HTML documents 
> anymore as the XHTML tags returned by Tika can correspond to a different 
> format for the original document. There is a duplication of code with the 
> html-plugin which will be resolved by either a) getting rid of the 
> html-plugin altogether or b) exporting its jar and make the tika parser 
> depend on it.
> The following libraries are required in the lib/ directory of the tika-parser 
> : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have 
> a look at each individual format and check that it is covered by Tika and if 
> so to the same extent; the Wiki is probably the right place for this. The 
> language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a 
> first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (NUTCH-766) Tika parser

Reply via email to