Nutch Developers,
My experience recently seeing attempted fetches of many ingrida.be
URLs made me question the Nutch 0.8 algorithm for partitioning URLs
among TaskTrackers (and their children processes). As I understand
it, Nutch doesn't worry about two lexically distinct domains (e.g.,
inherit-the-wind.ingrida.be and clancy-brown.ingrida.be) being
fetched simultaneously, even though they might actually resolve to
the same IP address (66.154.11.25 in this case).
I wonder whether the current algorithm might be putting us in a
position where we're trying to simultaneously fetch a large number of
URLs from lexically distinct domains that actually resolve to a
handful of servers. In this case, the servers might get upset with
us, block or throttle our access, or otherwise slow the fetching
process and/or prevent it from being very successful.
Before we switched to using Nutch 0.8, we implemented our own URL
partitioning algorithm in the Nutch 0.7 source. Our approach was as
follows:
a) Find the (topN * 10) highest scoring URLs.
b) Resolve their domains to IP addresses. Here we had to manage our
own DNS cache, since the one in the JVM appeared to be limited in
size, and we couldn't control this limit. We kept the cache in memory
and didn't bother writing it out to disk at the end of a crawling
session, so the first segment in a crawling session suffered from
lots of time spent resolving domains. Since we implemented this,
several people have suggested various other DNS caching alternatives
to the one that's built into the JVM (e.g.,
http://cr.yp.to/djbdns.html and http://www.dnsjava.org/).
c) Limit the number of URLs per IP, which is why we start with (topN
* 10) when we only need topN.
d) Sort the resulting URLs by IP address with the biggest domains first.
e) Partition the list to the fetcher tasks in blocks of URLs sharing
the same IP address. Thus, the fetchers end up fetching the biggest
domains first, and whenever a fetcher finishes a domain, it just
grabs the next largest unprocessed domain from the list.
This strategy avoids having two fetchers hitting the same IP address,
and front-loads work on the IP addresses that are probably going to
take the longest time to finish.
Has anyone else thought about or implemented a partitioning scheme
based on IP addresses instead of lexical domain names?
Thanks,
- Chris
--
------------------------
Chris Schneider
TransPac Software, Inc.
[EMAIL PROTECTED]
------------------------
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers