Enhance ParserFactory plugin selection policy
---------------------------------------------
Key: NUTCH-88
URL: http://issues.apache.org/jira/browse/NUTCH-88
Project: Nutch
Type: Improvement
Components: indexer
Versions: 0.7, 0.8-dev
Reporter: Jerome Charron
Fix For: 0.8-dev
The ParserFactory choose the Parser plugin to use based on the content-types
and path-suffix defined in the parsers plugin.xml file.
The selection policy is as follow:
Content type has priority: the first plugin found whose "contentType" attribute
matches the beginning of the content's type is used.
If none match, then the first whose "pathSuffix" attribute matches the end of
the url's path is used.
If neither of these match, then the first plugin whose "pathSuffix" is the
empty string is used.
This policy has a lot of problems when no matching is found, because a random
parser is used (and there is a lot of chance this parser can't handle the
content).
On the other hand, the content-type associated to a parser plugin is specified
in the plugin.xml of each plugin (this is the value used by the ParserFactory),
AND the code of each parser checks itself in its code if the content-type is ok
(it uses an hard-coded content-type value, and not uses the value specified in
the plugin.xml => possibility of missmatches between content-type hard-coded
and content-type delcared in plugin.xml).
A complete list of problems and discussion aout this point is available in:
* http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html
* http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers