Mostly an FYI post on those working with country specific SE's:
Just to continue on this topic, the country code TLD I'm looking at
doesn't provide any information so we're back to crawling to find
domains. To add to the complexity, there's lots of people who register
.com's as their main domain here, instead of the country specific TLD.
So our intended solution is to hack the filter so that it only crawls
and follows sites that match the specific TLD *or* match ARIN's IP list
for addresses in the country. (note: Arin publishes a list of IP
assignments by country). Not perfect, but it sure beats hand review.)
We'll assume that if they're hosted here that they're likely a site
relevant to the country.
For any remaining sites we're going to offer a manually submission
service. The sites will be reviewed manually, then added into the
filter. (we've got a preliminary php program that does this running now).
ARIN's IP mapping to country isn't quite perfect. For example our
servers are located here yet show up as being in the range of another
country. I expect I'll occassionally review the list of sites we've
added manually and look for trends in the IP address list to see if we
have any missing ranges. At that point I can pull those domains from
the filter and just add them to the IP address list.
I'm concerned that adding a huge range of IP's to check will cause the
crawler to slow. However of the four bytes in an ip address, there are
only about 10 possibilities in the first byte (i.e. the
000.XXX.XXX.XXX). So we'll check just the first byte, then continue to
drill down if there's a match.
HTH.
Matt Kangas wrote:
glenn, i know that verisign makes this available for .com and .net as
"TLD zone files".
for ccTLDs like .us and .uk, you'll have to see if the TLD registrar
provides the same. the following page has some useful links to these
folks:
http://www.dnsstuff.com/info/dnslinks.htm
--matt
On Nov 29, 2005, at 10:23 AM, Insurance Squared Inc. wrote:
Along these same lines (as I'm interested in a similiar country-
specific project), is there any place to get a list of all the
domains for a specific TLD to use to seed nutch? i.e. if I wanted
to get a list of all currently registered .it, .de, or .ca's?
I've looked without success. I'm thinking that this information
isn't available due to spamming issues, however in the paper you
referenced they discuss crawling an entire TLD which seemed to
indicate they may have access to this info.
Thanks,
Glenn
Ken Krugler wrote:
Is there anyone that can implement a country crawler? I
estimate around 40m documents. Please send me info about your
prev work and how much time it would take to setup and money :-)
Check out the paper titled "Crawling a Country: Better Strategies
than Breadth-First for Web Page Ordering" by Ricardo Baeza-Yates &
others. They were using a crawl of Chilean domains to test
strategies for efficient crawling, so it seems like it would be of
interest to you.
The main problem we've run into in doing similar limited domain
crawls is that you wind up with many fewer hosts, and thus more
URLs/host in any given fetch loop. The restriction of being polite
(one thread per host) leads to lots of retry errors caused by
fetcher threads blocking on a host (IP address) that is already
being accessed by another fetcher thread, and thus lower pages/
second throughput.
So we've been making some mods to Nutch to improve our performance,
but it's not debugged yet...getting closer, though.
-- Ken
--
Matt Kangas / [EMAIL PROTECTED]
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general