[ http://issues.apache.org/jira/browse/NUTCH-88?page=comments#action_12332568 ]
Jerome Charron commented on NUTCH-88: ------------------------------------- Corrections are committed (http://svn.apache.org/viewcvs.cgi?rev=326889&view=rev). Sorry for the delay, but I do my best... (thanks Chris for proposing your help) Implementation Note: In this implementation, the MimeType.clean(String) method constructs a new MimeType object (the MimeType constructor clean the content-type) each type it is called. It was the speedest way for solving this issue. But it is not optimal code, since it will better for performance (avoid instantiating very short time life objects) that: 1. The clean method really contains the cleaning code. 2. The MimeType constructors uses the clean method. Regards > Enhance ParserFactory plugin selection policy > --------------------------------------------- > > Key: NUTCH-88 > URL: http://issues.apache.org/jira/browse/NUTCH-88 > Project: Nutch > Type: Improvement > Components: indexer > Versions: 0.7, 0.8-dev > Reporter: Jerome Charron > Assignee: Jerome Charron > Fix For: 0.8-dev > > The ParserFactory choose the Parser plugin to use based on the content-types > and path-suffix defined in the parsers plugin.xml file. > The selection policy is as follow: > Content type has priority: the first plugin found whose "contentType" > attribute matches the beginning of the content's type is used. > If none match, then the first whose "pathSuffix" attribute matches the end of > the url's path is used. > If neither of these match, then the first plugin whose "pathSuffix" is the > empty string is used. > This policy has a lot of problems when no matching is found, because a random > parser is used (and there is a lot of chance this parser can't handle the > content). > On the other hand, the content-type associated to a parser plugin is > specified in the plugin.xml of each plugin (this is the value used by the > ParserFactory), AND the code of each parser checks itself in its code if the > content-type is ok (it uses an hard-coded content-type value, and not uses > the value specified in the plugin.xml => possibility of missmatches between > content-type hard-coded and content-type delcared in plugin.xml). > A complete list of problems and discussion aout this point is available in: > * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html > * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
