[ http://issues.apache.org/jira/browse/NUTCH-34?page=comments#action_63147 ] Andrzej Bialecki commented on NUTCH-34: ----------------------------------------
Stephan, Regarding the urlfilter config file: well, my point was that it would be nice to have a single place to turn on/off various plugins. The alternative is to do it for each plugin separately... We could change the format though - instead of extensions we could perhaps use plugin IDs...? This file could also define the ordering (or priority) of plugins. Regarding the plugin ordering: parser plugins are somewhat exceptional, because only one of them has to be invoked. Other plugins are used as a filtering chain - but even in those cases their order matters. For the parsing plugins currently the algorithm works as follows (copied from ParserFactory): [Parser extensions should define the attributes "contentType" and/or "pathSuffix". Content type has priority: the first plugin found whose "contentType" attribute matches the beginning of the content's type is used. If none match, then the first whose "pathSuffix" attribute matches the end of the url's path is used. If neither of these match, then the first plugin whose "pathSuffix" is the empty string is used.] This means that if there are more parsers for the same content type and path suffix, only the first on the list will always be used. > Parsing different content formats > --------------------------------- > > Key: NUTCH-34 > URL: http://issues.apache.org/jira/browse/NUTCH-34 > Project: Nutch > Type: Improvement > Components: fetcher > Reporter: Stephan Strittmatter > Priority: Trivial > > At the moment Nuch is set up to filter content by config the xml-config file. > There it is also set global how many bytes are loaded. > I think it yould be better to let the parser plugins "register" themselfe in > some registry where every plugin could tell the fetcher, that: > 1. this document type is wanted (because the parser plugin is > installed and activated) > 2. how much of the content is required (some plugins need the whole > content and some not) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------- This SF.Net email is sponsored by: New Crystal Reports XI. Version 11 adds new functionality designed to reduce time involved in creating, integrating, and deploying reporting solutions. Free runtime info, new features, or free trial, at: http://www.businessobjects.com/devxi/728 _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers