Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by ChrisMattmann:
http://wiki.apache.org/nutch/ParserFactoryImprovementProposal

------------------------------------------------------------------------------
  = Parser Factory Improvement Proposal =
  
+ Jerome Charron <[EMAIL PROTECTED]>, 
+ Sébastien Le Callonnec <[EMAIL PROTECTED]>, 
+ Chris A. Mattmann <[EMAIL PROTECTED]>
+ 
+ Wednesday, September 14th, 2005
+ 
+ '''DRAFT'''
  
  == Summary of Issue ==
  Currently Nutch provides a plugin mechanism wherein plugins register certain 
metadata about themselves, including their id, classname, and so forth. In 
particular, the set of parsing plugins register which contentTypes and file 
suffixes they can support with a PluginRepository.
@@ -20, +27 @@

  We propose that the “plugin preference list” should be a separate file 
that lives in $NUTCH_HOME/conf called “parse-plugins.xml”. The format of 
the file (full DTD to be developed during coding) should be something like: {{{
  
  <parse-plugins>
-   <default pluginname=”parse-text”/>
-   <fileType name=”powerpoint”>
-    <mimeTypes>
-     <mimeType name=”application/pdf” />
-     <mimeType name=”application/x-pdf” />
-     …
-    </mimeTypes>
  
-    <plugins>
+   <mimeType name="*">
+       <plugin name=”parse-text” order=”1”/>
+       <plugin name=”another-one-default-parser” order=”2”/>
+      ....
+   </mimeType>
+   
+   <mime-type name="application/vnd.ms-powerpoint">
+    <!-- if no order is specified, then order is significant -->
+     <plugin id="parse-mspowerpoint"/>
+   </mime-type>
  
+   <mime-type name="application/pdf">
-       <plugin name=”parse-pdf” order=”1”/>
+     <plugin id="parse-pdf" order="1"/>
-       <plugin name=”parse-pdf-worse” order=”2”/>
+     <plugin id="parse-pdf-worse" order="2" />
+   </mime-type>
+   ....
-      …
-    </plugins>
-   </fileType>
-     …
  </parse-plugins>
  
  }}}
  
+ === Activating Parse Plugins ===
+ If an activated parse plugin is not listed in the parse-plugins.xml, then it 
won’t get called for parsing. The purpose of the parse-plugins.xml file would 
be to map parsing-plugin to contentType. Therefore, if an activated plugin is 
not mapped to a content type, then it is “activated”, but won’t get 
called. This is very similar to Apache HTTPD. See below:
  
- One of the main impacts of having a file like parse-plugins.xml is that no 
longer should the pathSuffix="" be part of the plugin.xml descriptor. We 
propose to move that out of plugin.xml and into the mime-types.xml file.
+ {{{
+ //httpd.conf example
+ //add handler for php
+ 
+ LoadModule php4_module        libexec/httpd/libphp4.so
+ 
+ // map handler to mimeType
+ AddType application/x-httpd-php .php
+ AddType application/x-httpd-php-source .phps
+ 
+ AddHandler php-script   php
+ AddHandler php-script   phps
+ }}}
+ There are two different levels in the above example. First, the plugin is 
“activated” in the LoadModule section. Then, the plugin is “mapped” to 
a content type in the AddHandler section. We believe that this is the way to 
go. Apache HTTPD is pervasive, and its model is well understood by many of the 
same folks who would want to use Nutch. Although we realize that this is a 
change from the way that Nutch currently works, and that people don’t like 
change, we believe that this change is entirely needful and represents 
something that Nutch should adopt.
+ 
+ === Maintaining consistency between parse-plugins.xml and nutch-default.xml 
activated plugins ===
+ An interesting question arises in the following two examples:
+ 
+  *No plugin defined in parse-plugins for a specified content-type, but many 
activated plugins that can deal with this content-type.
+  *Many plugins defined in the parse-plugins for a specified content-type, but 
with the same priority
+ 
+ This is unfortunately is something that as developers we cannot elegantly 
prevent in this case – erroneous input by the user. We propose a simple way 
to handle this is: if the user specifies multiple parse-plugins with the same 
priority, then LOG.severe(), and exit. This isn’t anything outside of what 
other systems do with bogus user input. For instance, in Apache HTTPD, if a 
user specifies that .cgi files should be handled by a text-handler, ''and'' by 
a perl-handler, Apache HTTPD will come back, and log an error message, and 
exit, which we believe is the correct thing to do in that case. The 
parse-plugins.xml file will need to be examined by the users of the Nutch 
system, and they will need to ensure that they don’t’ set the priorities 
for 2 different parse plugins to be the same for a particular mimeType. We 
propose to note this in a comment in the parse-plugins.xml file, and then also 
note it as a major change in the Nutch installation process.
+ 
+ === Path Suffix Attribute in plugin.xml files and erroneous mime types 
returned by web servers for files ===
+ Another one of the main impacts of having a file like parse-plugins.xml is 
that no longer should the pathSuffix="" be part of the plugin.xml descriptor. 
We propose to move that out of plugin.xml and into the mime-types.xml file. 
Additionally, we can also "kill two birds with one stone" here and handle an 
oft-occuring problem users are experiencing with Nutch in terms of errorneous 
mime types returned by web servers for particular files. Specifically we 
propose to add an MimeType Alias mapper to the mime-types.xml file that will 
allow us to map the standard IANA mime types to other web server returned mime 
types that are non-standard. These two proposed changes to mime-types.xml would 
look like the following:
+ 
+ {{{
+ 
+ <!-- mime-types.xml file -->
+   <mime-type name="application/vnd.ms-powerpoint">
+     <!-- pathSuffix lives here now -->
+       <ext>ppt</ext>
+       <magic offset="....." type="..." value="..."/>
+ 
+     <!-- here are other mime types that are not the default IANA mime types, 
but still returned by servers -->
+       <mapped-type name="application/powerpoint"/>
+       <mapped-type name="application/mspowerpoint"/>
+    </mime-type>
+  
+ }}}
+ 
+ To handle this mapping, two new methods should be added to the mime types 
class. In particular, we propose a {{{public static MimeType map(MimeType);}}} 
method and a {{{public static MimeType map(String);}}} method to be added to 
the MimeType java class to handle the mapping in the mime-types.xml file.
+ 
+ 
+ === Proposal Task Summary ===
+ To summarize, our proposal to improve the parser factory consists of the 
following tasks:
+ 
+  1. Provide a mime-type mapper (based on IANA) in the util.mime package. 
Implementation to be refined: Uses and extension of the existing mime-type.dtd
+  2. Define a schema for the parse-plugin.xml file
+  3. Deprecate the pathSuffix from plugin.xml file
+  4. ParserFactory must check the content-type used in the parse-plugin.xml 
file and the content-type(s) specified in the plugin.xml; If it matches, all is 
ok, if not the plugin is not used.
  
  == Architectural Impact ==
  
@@ -49, +109 @@

   *Fetcher
   *PluginSystem
   *ParserFactory
+  *MimeTypeSystem
  
  === Impact on current releases of Nutch ===
  
  ''Incompatibilities''
  
- By moving the contentType and pathSuffix out of the plugin.xml file, this 
would create an updated version of the plugin.xml descriptor schema for each 
plugin. To lessen the effect on previous and near-term releases of Nutch this 
information could be left as an option in the plugin.xml schema, but marked as 
“deprecated” to let people know that this functionality isn’t part of the 
parse plugin identification process anymore, but it is left in the schema so as 
not to create incompatibilities with the plugin.xml files that people have 
already wrote. However, ultimately in future releases of Nutch, we propose that 
the contentType and pathSuffix attributes should be removed from the plugin.xml 
schema.
+ By moving the pathSuffix out of the plugin.xml file, and into the 
mime-types.xml file, this would create an updated version of the plugin.xml 
descriptor schema for each plugin, along with an updated mime-types.xml 
descriptor schema. Additionally, storing the mime type aliases in the 
mime-types.xml file will also require an addition to the mime-types.xml schema. 
To lessen the effect on previous and near-term releases of Nutch the pathSuffix 
attribute could be left as an option in the plugin.xml schema, but marked as 
“deprecated” to let people know that this functionality isn’t part of the 
parse plugin identification process anymore, but it is left in the schema so as 
not to create incompatibilities with the plugin.xml files that people have 
already wrote. However, ultimately in future releases of Nutch, we propose that 
the pathSuffix attribute should be removed from the plugin.xml schema.
  
- Other than the plugin.xml file schema change, this capability addition will 
simply control the order in which parsing plugins get called during fetching 
activities. It won’t directly impact the segments stored, or the webapp, or 
any of the main components of Nutch.
+ The proposed capability addition will simply control the order in which 
parsing plugins get called during fetching activities. It won’t directly 
impact the segments stored, or the webapp. It will only affect the fetcher 
component, and the mime types component.
  
  ''Issues''
  
  The proposed new capabilities should be first tested on local systems, and if 
successful, uploaded to JIRA, and verified against the latest SVNs.
- Unit tests should be written to verify appropriate plugin parsing order.
- Users will need to be notified in the Nutch tutorial and instruction lists 
about how to set up the parsing plugin preferences prior to performing a fetch.
+ Unit tests should be written to verify appropriate plugin parsing order. 
Users will need to be notified in the Nutch tutorial and instruction lists 
about how to set up the parsing plugin preferences prior to performing a fetch.
  
  == Personnel ==
  
@@ -72, +132 @@

  
  == Timeframe ==
  
-  *Begin work the weekend of 9/9
+  *Begin work the weekend of 9/16
-  *Complete first prototype patches to JIRA by end of week, 9/18
+  *Complete first prototype patches to JIRA by end of week, 9/25
-  *Test against latest SVNs of Nutch, by 9/25
+  *Test against latest SVNs of Nutch, by 10/1
-  *Delivery of operational capability, by 10/1
+  *Delivery of operational capability, by 10/8
  
  == Affected files ==
   *PluginRepository.java
   *PluginManifestParser.java
   *ParserFactory.java
   *plugin.xml descriptor files
+  *mime-types.xml file
+  *addition of parse-plugins.xml file
   *files in package {{{org.apache.nutch.util.mime}}}
  

Reply via email to