Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by ChrisMattmann:
http://wiki.apache.org/nutch/ParserFactoryImprovementProposal

The comment on the change is:
Initial Draft of ParserFactoryImprovementProposal

New page:
= Parser Factory Improvement Proposal =


== Summary of Issue ==
Currently Nutch provides a plugin mechanism wherein plugins register certain 
metadata about themselves, including their id, classname, and so forth. In 
particular, the set of parsing plugins register which contentTypes and file 
suffixes they can support with a PluginRepository.

One “adopted practice” in current Nutch parsing plugins (committed in 
Subversion, e.g., see parse-pdf, parse-rss, etc.) has also been to verify that 
the content type passed to it during a fetch is indeed one of the contentTypes 
that it supports (be it application/xml, or application/pdf, etc.). This 
practice is cumbersome for a few reasons:

 *Any updates to supported content types for a parsing plugin will require a 
recompilation of the plugin code
 *Checking for “hard coded” content types within the parsing plugin is a 
duplication of information that already exists in the plugin’s descriptor 
file, plugin.xml
 *By the time that content gets to a parsing plugin, (e.g., the parsing plugin 
is returned by the ParserFactory, and provided content during a fetch), the 
ParsingFactory should have already ensured that the appropriate plugin is 
getting called for a particular contentType.

In addition to this problem is the fact that several parsing plugins may all 
support many of the same content types. For instance, the parse-js plugin may 
be the only well suited parsing plugin for javascript, but perhaps it may also 
provided a good enough heuristic parser for plain text as well, and so it may 
support both types. However, there may be a parsing plugin for text (which 
there is!), parse-text, whose primary purpose is to parse plain text as well.

== Suggested Remedy ==
To deal with ensuring the desired parsing plugin is called for the appropriate 
content type, and to in effect, “kill two birds with one stone”, we propose 
that there be a parsing plugin preference list for each content type that Nutch 
knows how to handle, i.e., each content type available via the mimeType system. 
Therefore, during a fetch, once the appropriate mimeType has been determined 
for content, and the ParserFactory is tasked with returning a parsing plugin, 
the ParserFactory should consult a preference list for that contentType, 
allowing it to determine which plugin has the highest preference for the 
contentType. That parsing plugin should be returned via the ParserFactory to 
the fetcher. If there is any problem using the initial returned parsing plugin 
for a particular contentType (i.e., if a ParseException is throw during the 
parser, or a null ParseStatus is returned), then the ParserFactory should be 
called again, this time asking for the “next highest ranked
 ” plugin for that contentType. Such a process should repeat on and on until 
the parse is successful.

We propose that the “plugin preference list” should be a separate file that 
lives in $NUTCH_HOME/conf called “parse-plugins.xml”. The format of the 
file (full DTD to be developed during coding) should be something like: {{{

<parse-plugins>
  <default pluginname=”parse-text”/>
  <fileType name=”powerpoint”>
   <mimeTypes>
    <mimeType name=”application/pdf” />
    <mimeType name=”application/x-pdf” />
    …
   </mimeTypes>

   <plugins>

      <plugin name=”parse-pdf” order=”1”/>
      <plugin name=”parse-pdf-worse” order=”2”/>
     …
   </plugins>
  </fileType>
    …
</parse-plugins>

}}}


One of the main impacts of having a file like parse-plugins.xml is that no 
longer should the pathSuffix="" be part of the plugin.xml descriptor. We 
propose to move that out of plugin.xml and into the mime-types.xml file.

== Architectural Impact ==

=== Components ===
 *Fetcher
 *PluginSystem
 *ParserFactory

=== Impact on current releases of Nutch ===

''Incompatibilities''

By moving the contentType and pathSuffix out of the plugin.xml file, this would 
create an updated version of the plugin.xml descriptor schema for each plugin. 
To lessen the effect on previous and near-term releases of Nutch this 
information could be left as an option in the plugin.xml schema, but marked as 
“deprecated” to let people know that this functionality isn’t part of the 
parse plugin identification process anymore, but it is left in the schema so as 
not to create incompatibilities with the plugin.xml files that people have 
already wrote. However, ultimately in future releases of Nutch, we propose that 
the contentType and pathSuffix attributes should be removed from the plugin.xml 
schema.

Other than the plugin.xml file schema change, this capability addition will 
simply control the order in which parsing plugins get called during fetching 
activities. It won’t directly impact the segments stored, or the webapp, or 
any of the main components of Nutch.

''Issues''

The proposed new capabilities should be first tested on local systems, and if 
successful, uploaded to JIRA, and verified against the latest SVNs.
Unit tests should be written to verify appropriate plugin parsing order.
Users will need to be notified in the Nutch tutorial and instruction lists 
about how to set up the parsing plugin preferences prior to performing a fetch.

== Personnel ==

 *Jerome Charron
 *Sébastien Le Callonnec
 *Chris A. Mattmann

== Timeframe ==

 *Begin work the weekend of 9/9
 *Complete first prototype patches to JIRA by end of week, 9/18
 *Test against latest SVNs of Nutch, by 9/25
 *Delivery of operational capability, by 10/1

== Affected files ==
 *PluginRepository.java
 *PluginManifestParser.java
 *ParserFactory.java
 *plugin.xml descriptor files
 *files in package {{{org.apache.nutch.util.mime}}}

Reply via email to