[ http://issues.apache.org/jira/browse/NUTCH-34?page=comments#action_63049 ] Andrzej Bialecki commented on NUTCH-34: ----------------------------------------
Stephan & Jerome, Let me explain why I think a boolean is useful (though strictly speaking not required, as you noticed). When the Fetcher gets a content header (be it HTTP or other protocol header), it learns about the content type (if present) and the content size. Based on the content type it can select a parse plugin. Now, if the content size exceeds the maximum size set in the plugin, the Fetcher currently has only one choice - to fetch up to maximum size of bytes, pass this partial content to the plugin and pray that it works. However, if we introduce a boolean property with the meaning "plugin can handle partial content", then the Fetcher can make an informed decision whether to fetch the partial content at all. As a result, we can gain significant bandwidth/disk space/CPU savings. Also, this type of information is very easy to provide... Setting the maximum size to "0" has different semantics - it simply means that Fetcher should fetch all content, no matter its size. Regarding the plugin registry: IMHO it needs a configuration file anyway. There needs to be a mechanism in place to preserve ordering and priority of active plugins (more sophisticated than the current nearly random way), especially if more than one plugin handles the same mime type. I agree that it's convenient if each plugin "registers itself" for handling given mime types, but I would add to that "with certain priority if more than one plugin exists for a given type". Again, IMHO, it is convenient also to have a single place to quickly turn on/off various plugins - it could be a config file, or it could be an API (perhaps both?). > Parsing different content formats > --------------------------------- > > Key: NUTCH-34 > URL: http://issues.apache.org/jira/browse/NUTCH-34 > Project: Nutch > Type: Improvement > Components: fetcher > Reporter: Stephan Strittmatter > Priority: Trivial > > At the moment Nuch is set up to filter content by config the xml-config file. > There it is also set global how many bytes are loaded. > I think it yould be better to let the parser plugins "register" themselfe in > some registry where every plugin could tell the fetcher, that: > 1. this document type is wanted (because the parser plugin is > installed and activated) > 2. how much of the content is required (some plugins need the whole > content and some not) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
