Nutch Developers,

My experience recently seeing attempted fetches of many ingrida.be URLs made me question the Nutch 0.8 algorithm for partitioning URLs among TaskTrackers (and their children processes). As I understand it, Nutch doesn't worry about two lexically distinct domains (e.g., inherit-the-wind.ingrida.be and clancy-brown.ingrida.be) being fetched simultaneously, even though they might actually resolve to the same IP address (66.154.11.25 in this case).

I wonder whether the current algorithm might be putting us in a position where we're trying to simultaneously fetch a large number of URLs from lexically distinct domains that actually resolve to a handful of servers. In this case, the servers might get upset with us, block or throttle our access, or otherwise slow the fetching process and/or prevent it from being very successful.

Before we switched to using Nutch 0.8, we implemented our own URL partitioning algorithm in the Nutch 0.7 source. Our approach was as follows:

a) Find the (topN * 10) highest scoring URLs.

b) Resolve their domains to IP addresses. Here we had to manage our own DNS cache, since the one in the JVM appeared to be limited in size, and we couldn't control this limit. We kept the cache in memory and didn't bother writing it out to disk at the end of a crawling session, so the first segment in a crawling session suffered from lots of time spent resolving domains. Since we implemented this, several people have suggested various other DNS caching alternatives to the one that's built into the JVM (e.g., http://cr.yp.to/djbdns.html and http://www.dnsjava.org/).

c) Limit the number of URLs per IP, which is why we start with (topN * 10) when we only need topN.

d) Sort the resulting URLs by IP address with the biggest domains first.

e) Partition the list to the fetcher tasks in blocks of URLs sharing the same IP address. Thus, the fetchers end up fetching the biggest domains first, and whenever a fetcher finishes a domain, it just grabs the next largest unprocessed domain from the list.

This strategy avoids having two fetchers hitting the same IP address, and front-loads work on the IP addresses that are probably going to take the longest time to finish.

Has anyone else thought about or implemented a partitioning scheme based on IP addresses instead of lexical domain names?

Thanks,

- Chris

--
------------------------
Chris Schneider
TransPac Software, Inc.
[EMAIL PROTECTED]
------------------------


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to