[ http://issues.apache.org/jira/browse/NUTCH-88?page=comments#action_12332609 ]
Doug Cutting commented on NUTCH-88: ----------------------------------- Jerome, This works well now. I've merged your changes to the mapred branch. Thanks! Doug > Enhance ParserFactory plugin selection policy > --------------------------------------------- > > Key: NUTCH-88 > URL: http://issues.apache.org/jira/browse/NUTCH-88 > Project: Nutch > Type: Improvement > Components: indexer > Versions: 0.7, 0.8-dev > Reporter: Jerome Charron > Assignee: Jerome Charron > Fix For: 0.8-dev > > The ParserFactory choose the Parser plugin to use based on the content-types > and path-suffix defined in the parsers plugin.xml file. > The selection policy is as follow: > Content type has priority: the first plugin found whose "contentType" > attribute matches the beginning of the content's type is used. > If none match, then the first whose "pathSuffix" attribute matches the end of > the url's path is used. > If neither of these match, then the first plugin whose "pathSuffix" is the > empty string is used. > This policy has a lot of problems when no matching is found, because a random > parser is used (and there is a lot of chance this parser can't handle the > content). > On the other hand, the content-type associated to a parser plugin is > specified in the plugin.xml of each plugin (this is the value used by the > ParserFactory), AND the code of each parser checks itself in its code if the > content-type is ok (it uses an hard-coded content-type value, and not uses > the value specified in the plugin.xml => possibility of missmatches between > content-type hard-coded and content-type delcared in plugin.xml). > A complete list of problems and discussion aout this point is available in: > * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html > * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
