[ 
http://issues.apache.org/jira/browse/NUTCH-88?page=comments#action_12322997 ] 

Chris A. Mattmann commented on NUTCH-88:
----------------------------------------

I'm currently working on writing a proposal for addressing this issue. The 
proposal will include the following information:

* summary of issue

* suggested remedy

* architectural impact

* impact on current releases of Nutch
 - incompatabilities
 - any other issues

* available resources

* timeframe

I hope to have it done by tomorrow afternoon, say 3pm Pacific Standard Time.

Thanks,
  Chris

> Enhance ParserFactory plugin selection policy
> ---------------------------------------------
>
>          Key: NUTCH-88
>          URL: http://issues.apache.org/jira/browse/NUTCH-88
>      Project: Nutch
>         Type: Improvement
>   Components: indexer
>     Versions: 0.7, 0.8-dev
>     Reporter: Jerome Charron
>      Fix For: 0.8-dev

>
> The ParserFactory choose the Parser plugin to use based on the content-types 
> and path-suffix defined in the parsers plugin.xml file.
> The selection policy is as follow:
> Content type has priority: the first plugin found whose "contentType" 
> attribute matches the beginning of the content's type is used. 
> If none match, then the first whose "pathSuffix" attribute matches the end of 
> the url's path is used.
> If neither of these match, then the first plugin whose "pathSuffix" is the 
> empty string is used.
> This policy has a lot of problems when no matching is found, because a random 
> parser is used (and there is a lot of chance this parser can't handle the 
> content).
> On the other hand, the content-type associated to a parser plugin is 
> specified in the plugin.xml of each plugin (this is the value used by the 
> ParserFactory), AND the code of each parser checks itself in its code if the 
> content-type is ok (it uses an hard-coded content-type value, and not uses 
> the value specified in the plugin.xml => possibility of missmatches between 
> content-type hard-coded and content-type delcared in plugin.xml).
> A complete list of problems and discussion aout this point is available in:
>   * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html
>   * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to