[ 
http://issues.apache.org/jira/browse/NUTCH-88?page=comments#action_12323001 ] 

Dawid Weiss commented on NUTCH-88:
----------------------------------

Hi.

I share your opinion -- this is an important issue. If I may add my few cents, 
the crawler should try to mimic a browser in handling mime types. This, of 
course, gets quite complex since Internet Explorer has a very confusing and 
unnecessarily complex mime type handling heuristic... which happens to change 
from version to version as well. Anyway, if you care to look, there are a few 
articles that explain the steps performed by IE to resolve a mime type of a Web 
page --

http://msdn.microsoft.com/workshop/networking/moniker/overview/appendix_a.asp
http://msdn.microsoft.com/workshop/networking/moniker/overview/mime_handling.asp

D.

> Enhance ParserFactory plugin selection policy
> ---------------------------------------------
>
>          Key: NUTCH-88
>          URL: http://issues.apache.org/jira/browse/NUTCH-88
>      Project: Nutch
>         Type: Improvement
>   Components: indexer
>     Versions: 0.7, 0.8-dev
>     Reporter: Jerome Charron
>      Fix For: 0.8-dev

>
> The ParserFactory choose the Parser plugin to use based on the content-types 
> and path-suffix defined in the parsers plugin.xml file.
> The selection policy is as follow:
> Content type has priority: the first plugin found whose "contentType" 
> attribute matches the beginning of the content's type is used. 
> If none match, then the first whose "pathSuffix" attribute matches the end of 
> the url's path is used.
> If neither of these match, then the first plugin whose "pathSuffix" is the 
> empty string is used.
> This policy has a lot of problems when no matching is found, because a random 
> parser is used (and there is a lot of chance this parser can't handle the 
> content).
> On the other hand, the content-type associated to a parser plugin is 
> specified in the plugin.xml of each plugin (this is the value used by the 
> ParserFactory), AND the code of each parser checks itself in its code if the 
> content-type is ok (it uses an hard-coded content-type value, and not uses 
> the value specified in the plugin.xml => possibility of missmatches between 
> content-type hard-coded and content-type delcared in plugin.xml).
> A complete list of problems and discussion aout this point is available in:
>   * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html
>   * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to