[ 
http://issues.apache.org/jira/browse/NUTCH-88?page=comments#action_12323007 ] 

Jerome Charron commented on NUTCH-88:
-------------------------------------

Dawid,
Thanks for your pointers on IE MimeType resolution. We have in Nutch a MimeType 
resolver based on both file-ext and files "magic" sequences to find the 
content-type of a file. It is actually underused, and perhaps some enhancement 
must be added: such as the content-type mapping: allow to map a content-type to 
a normalized one (ie mapping for instance application/powerpoint to 
application/vnd.ms-powerpoint, so that only the normalized version must be 
registered in the plugin.xml file).

Chris,
Thanks in advance for your futur work. Could you please synchronize your 
efforts with Sébastien, since he seems very interested to contribute.

Andrzej,
The way to express a preference of one plugin over another, if both support the 
same content type is to activate the plugin you want to handle a content-type 
and deactivate onther ones.
No?

Note: Since the MimeResolver handles associations between file-extensions and 
content-types, the path-suffix in plugin.xml (and in ParserFactory policy for 
choosing a Parser) could certainly be removed in order to have only one central 
point for storing this knowledge.

> Enhance ParserFactory plugin selection policy
> ---------------------------------------------
>
>          Key: NUTCH-88
>          URL: http://issues.apache.org/jira/browse/NUTCH-88
>      Project: Nutch
>         Type: Improvement
>   Components: indexer
>     Versions: 0.7, 0.8-dev
>     Reporter: Jerome Charron
>      Fix For: 0.8-dev

>
> The ParserFactory choose the Parser plugin to use based on the content-types 
> and path-suffix defined in the parsers plugin.xml file.
> The selection policy is as follow:
> Content type has priority: the first plugin found whose "contentType" 
> attribute matches the beginning of the content's type is used. 
> If none match, then the first whose "pathSuffix" attribute matches the end of 
> the url's path is used.
> If neither of these match, then the first plugin whose "pathSuffix" is the 
> empty string is used.
> This policy has a lot of problems when no matching is found, because a random 
> parser is used (and there is a lot of chance this parser can't handle the 
> content).
> On the other hand, the content-type associated to a parser plugin is 
> specified in the plugin.xml of each plugin (this is the value used by the 
> ParserFactory), AND the code of each parser checks itself in its code if the 
> content-type is ok (it uses an hard-coded content-type value, and not uses 
> the value specified in the plugin.xml => possibility of missmatches between 
> content-type hard-coded and content-type delcared in plugin.xml).
> A complete list of problems and discussion aout this point is available in:
>   * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html
>   * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to