Along these same lines (as I'm interested in a similiar country-specific
project), is there any place to get a list of all the domains for a
specific TLD to use to seed nutch? i.e. if I wanted to get a list of
all currently registered .it, .de, or .ca's?
I've looked without success. I'm thinking that this information isn't
available due to spamming issues, however in the paper you referenced
they discuss crawling an entire TLD which seemed to indicate they may
have access to this info.
Thanks,
Glenn
Ken Krugler wrote:
Is there anyone that can implement a country crawler? I estimate
around 40m documents. Please send me info about your prev work and
how much time it would take to setup and money :-)
Check out the paper titled "Crawling a Country: Better Strategies than
Breadth-First for Web Page Ordering" by Ricardo Baeza-Yates & others.
They were using a crawl of Chilean domains to test strategies for
efficient crawling, so it seems like it would be of interest to you.
The main problem we've run into in doing similar limited domain crawls
is that you wind up with many fewer hosts, and thus more URLs/host in
any given fetch loop. The restriction of being polite (one thread per
host) leads to lots of retry errors caused by fetcher threads blocking
on a host (IP address) that is already being accessed by another
fetcher thread, and thus lower pages/second throughput.
So we've been making some mods to Nutch to improve our performance,
but it's not debugged yet...getting closer, though.
-- Ken
- Re: Setting up a crawler for a country. Insurance Squared Inc.
-