Is there anyone that can implement a country crawler? I estimate around 40m documents. Please send me info about your prev work and how much time it would take to setup and money :-)

Check out the paper titled "Crawling a Country: Better Strategies than Breadth-First for Web Page Ordering" by Ricardo Baeza-Yates & others. They were using a crawl of Chilean domains to test strategies for efficient crawling, so it seems like it would be of interest to you.

The main problem we've run into in doing similar limited domain crawls is that you wind up with many fewer hosts, and thus more URLs/host in any given fetch loop. The restriction of being polite (one thread per host) leads to lots of retry errors caused by fetcher threads blocking on a host (IP address) that is already being accessed by another fetcher thread, and thus lower pages/second throughput.

So we've been making some mods to Nutch to improve our performance, but it's not debugged yet...getting closer, though.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to