Nutch - Focused crawling
Hi, We've been using Nutch for focused crawling (right now we are crawling about 50 domains). We've encountered the long-tail problem - We've set TopN to 100,000 and generate.max.per.host to about 1500. 90% of all domains finish fetching after 30min, and the other 10% takes an additional 2.5 hours - making the slowest domain the bottleneck of the entire fetch process. I've read Ken Krugler document and he's describing the same problem: http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/ I'm wondering - does anyone have a suggestion on what's the best way to tackle this issue? I think that Ken suggested to limit the fetch time - for example say terminate after 1 hour, even if you are not done yet, is that feature available in Nutch? I will be happy to try and contribute code if required! Thanks, Eran
Re: Nutch - Focused crawling
Hi Eran, There is currently no time limit implemented in the Fetcher. We implemented one which worked quite well in combination with another mechanism which clears the URLs from a pool if more than x successive exceptions have been encountered. This limits cases where a site or domain is not responsive. I might try and submit a patch if I find the time next week, our code has been heavily modified with the previous patches which have not been committed to the trunk yet (NUTCH-753 / NUTCH-719 / NUTCH-658) so I'd need to spend a bit of time extracting this specific functionality from the rest. Best, Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/11/21 Eran Zinman zze...@gmail.com Hi, We've been using Nutch for focused crawling (right now we are crawling about 50 domains). We've encountered the long-tail problem - We've set TopN to 100,000 and generate.max.per.host to about 1500. 90% of all domains finish fetching after 30min, and the other 10% takes an additional 2.5 hours - making the slowest domain the bottleneck of the entire fetch process. I've read Ken Krugler document and he's describing the same problem: http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/ I'm wondering - does anyone have a suggestion on what's the best way to tackle this issue? I think that Ken suggested to limit the fetch time - for example say terminate after 1 hour, even if you are not done yet, is that feature available in Nutch? I will be happy to try and contribute code if required! Thanks, Eran
Re: Nutch upgrade to Hadoop
I have created NUTCH-768. I am in the middle of testing a few thousand page crawl for the most recent released version of Hadoop 0.20.1. Everything passes unit tests fine and there are no interface breaks. Looks like it will be an easy upgrade so far :) Dennis Andrzej Bialecki wrote: Dennis Kubes wrote: I would like to get a couple things in this release as well. Let me know if you want help with the upgrade. You mean you want to do the Hadoop upgrade? I won't stand in your way :)
Re: Nutch upgrade to Hadoop
Dennis Kubes wrote: I have created NUTCH-768. I am in the middle of testing a few thousand page crawl for the most recent released version of Hadoop 0.20.1. Everything passes unit tests fine and there are no interface breaks. Looks like it will be an easy upgrade so far :) Great, thanks! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch upgrade to Hadoop
Hadoop seems to have a few more configuration files now. There are some questions about which ones to move over. I think it also might be time to upgrade xerces from 2.6 to the current 2.9.1. I am testing with that currently. I know with the current version there are some (ignorable but annoying) errors thrown from configuration. Dennis Andrzej Bialecki wrote: Dennis Kubes wrote: I have created NUTCH-768. I am in the middle of testing a few thousand page crawl for the most recent released version of Hadoop 0.20.1. Everything passes unit tests fine and there are no interface breaks. Looks like it will be an easy upgrade so far :) Great, thanks!
AbstractFetchSchedule
there is some piece of code i dont understand public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) { // pages are never truly GONE - we have to check them from time to time. // pages with too long fetchInterval are adjusted so that they fit within // maximum fetchInterval (segment retention period). if (datum.getFetchTime() - curTime (long) maxInterval * 1000) { datum.setFetchInterval(maxInterval * 0.9f); datum.setFetchTime(curTime); } if (datum.getFetchTime() curTime) { return false; // not time yet } return true; } why is the fetch time set here to curTime? and why is the fetch interval set to maxInterval * 0.9f whithout checking the current value of fetchInterval? regards reinhard