Re: Setting up a crawler for a country.

Ken Krugler Tue, 29 Nov 2005 10:27:04 -0800

Is there anyone that can implement a country crawler? Iestimate around 40m documents. Please send me info about your prevwork and how much time it would take to setup and money :-)

Check out the paper titled "Crawling a Country: Better Strategiesthan Breadth-First for Web Page Ordering" by Ricardo Baeza-Yates &others. They were using a crawl of Chilean domains to test strategiesfor efficient crawling, so it seems like it would be of interest toyou.

The main problem we've run into in doing similar limited domaincrawls is that you wind up with many fewer hosts, and thus moreURLs/host in any given fetch loop. The restriction of being polite(one thread per host) leads to lots of retry errors caused by fetcherthreads blocking on a host (IP address) that is already beingaccessed by another fetcher thread, and thus lower pages/secondthroughput.

So we've been making some mods to Nutch to improve our performance,but it's not debugged yet...getting closer, though.


-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200

Re: Setting up a crawler for a country.

Reply via email to