[ http://issues.apache.org/jira/browse/NUTCH-88?page=comments#action_12322997 ]
Chris A. Mattmann commented on NUTCH-88: ---------------------------------------- I'm currently working on writing a proposal for addressing this issue. The proposal will include the following information: * summary of issue * suggested remedy * architectural impact * impact on current releases of Nutch - incompatabilities - any other issues * available resources * timeframe I hope to have it done by tomorrow afternoon, say 3pm Pacific Standard Time. Thanks, Chris > Enhance ParserFactory plugin selection policy > --------------------------------------------- > > Key: NUTCH-88 > URL: http://issues.apache.org/jira/browse/NUTCH-88 > Project: Nutch > Type: Improvement > Components: indexer > Versions: 0.7, 0.8-dev > Reporter: Jerome Charron > Fix For: 0.8-dev > > The ParserFactory choose the Parser plugin to use based on the content-types > and path-suffix defined in the parsers plugin.xml file. > The selection policy is as follow: > Content type has priority: the first plugin found whose "contentType" > attribute matches the beginning of the content's type is used. > If none match, then the first whose "pathSuffix" attribute matches the end of > the url's path is used. > If neither of these match, then the first plugin whose "pathSuffix" is the > empty string is used. > This policy has a lot of problems when no matching is found, because a random > parser is used (and there is a lot of chance this parser can't handle the > content). > On the other hand, the content-type associated to a parser plugin is > specified in the plugin.xml of each plugin (this is the value used by the > ParserFactory), AND the code of each parser checks itself in its code if the > content-type is ok (it uses an hard-coded content-type value, and not uses > the value specified in the plugin.xml => possibility of missmatches between > content-type hard-coded and content-type delcared in plugin.xml). > A complete list of problems and discussion aout this point is available in: > * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html > * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira