[ http://issues.apache.org/jira/browse/NUTCH-34?page=comments#action_63145 ] Stephan Strittmatter commented on NUTCH-34: -------------------------------------------
Andrzej, about the plugin registry: I also agree with you, it should be possible to order in some way the plugins and activate or deactivate them by config/API. (About API calls Nutch could support then something like selfe "self-healing" if a specific parser fials too often. but this is another story...) But what I think, the restrictions, which content should be feched and which not by defining the excludes like: <code> # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$ </code> in crawl-urlfilter.txt should be managed by the parser plugins which are currently active. BTW, what will happen if there are more than one parser-plugin which feels responsible for a specific content? I haven't tried that. Will this make sense? As fallback if the first parser failed? > Parsing different content formats > --------------------------------- > > Key: NUTCH-34 > URL: http://issues.apache.org/jira/browse/NUTCH-34 > Project: Nutch > Type: Improvement > Components: fetcher > Reporter: Stephan Strittmatter > Priority: Trivial > > At the moment Nuch is set up to filter content by config the xml-config file. > There it is also set global how many bytes are loaded. > I think it yould be better to let the parser plugins "register" themselfe in > some registry where every plugin could tell the fetcher, that: > 1. this document type is wanted (because the parser plugin is > installed and activated) > 2. how much of the content is required (some plugins need the whole > content and some not) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------- This SF.Net email is sponsored by: New Crystal Reports XI. Version 11 adds new functionality designed to reduce time involved in creating, integrating, and deploying reporting solutions. Free runtime info, new features, or free trial, at: http://www.businessobjects.com/devxi/728 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
