[ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412601 ]
Matt Kangas commented on NUTCH-272: ----------------------------------- I've been thinking about this after hitting several sites that explode into 1.5 M URLs (or more). I could sleep easier at night if I could set a cap at 50k URLs/site and just check my log files in the morning. Counting total URLs/domain needs to happen in one of the places where Nutch already traverses the crawldb. For Nutch 0.8 this is "nutch generate" and "nutch updatedb". URLs are added by both "nutch inject" and "nutch updatedb". These tools use the URLFilter plugin x-point to determine which URLs to keep, and which to reject. But note that "updatedb" could only compute URLs/domain _after_ traversing crawldb, during which time it merges the new URLs. So, one way to approach it is: * Count URLs/domain during "update". If a domain exceeds the limit, write to a file. * Read this file at the start of "update" (next cycle) and block further additions * Or: read in a new URLFilter plugin, and block the URLs in URLFilter.filter() If you do it all in "update", you won't catch URLs added via "inject", but it would still halt runaway crawls, and it would be simpler because it would be a one-file patch. > Max. pages to crawl/fetch per site (emergency limit) > ---------------------------------------------------- > > Key: NUTCH-272 > URL: http://issues.apache.org/jira/browse/NUTCH-272 > Project: Nutch > Type: Improvement > Reporter: Stefan Neufeind > > If I'm right, there is no way in place right now for setting an "emergency > limit" to fetch a certain max. number of pages per site. Is there an "easy" > way to implement such a limit, maybe as a plugin? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
