[Nutch-dev] URL Partitioning (Lexical vs. IP Address)

Chris Schneider Mon, 20 Feb 2006 20:17:12 -0800

Nutch Developers,

My experience recently seeing attempted fetches of many ingrida.beURLs made me question the Nutch 0.8 algorithm for partitioning URLsamong TaskTrackers (and their children processes). As I understandit, Nutch doesn't worry about two lexically distinct domains (e.g.,inherit-the-wind.ingrida.be and clancy-brown.ingrida.be) beingfetched simultaneously, even though they might actually resolve tothe same IP address (66.154.11.25 in this case).

I wonder whether the current algorithm might be putting us in aposition where we're trying to simultaneously fetch a large number ofURLs from lexically distinct domains that actually resolve to ahandful of servers. In this case, the servers might get upset withus, block or throttle our access, or otherwise slow the fetchingprocess and/or prevent it from being very successful.

Before we switched to using Nutch 0.8, we implemented our own URLpartitioning algorithm in the Nutch 0.7 source. Our approach was asfollows:


a) Find the (topN * 10) highest scoring URLs.

b) Resolve their domains to IP addresses. Here we had to manage ourown DNS cache, since the one in the JVM appeared to be limited insize, and we couldn't control this limit. We kept the cache in memoryand didn't bother writing it out to disk at the end of a crawlingsession, so the first segment in a crawling session suffered fromlots of time spent resolving domains. Since we implemented this,several people have suggested various other DNS caching alternativesto the one that's built into the JVM (e.g.,http://cr.yp.to/djbdns.html and http://www.dnsjava.org/).

c) Limit the number of URLs per IP, which is why we start with (topN* 10) when we only need topN.


d) Sort the resulting URLs by IP address with the biggest domains first.

e) Partition the list to the fetcher tasks in blocks of URLs sharingthe same IP address. Thus, the fetchers end up fetching the biggestdomains first, and whenever a fetcher finishes a domain, it justgrabs the next largest unprocessed domain from the list.

This strategy avoids having two fetchers hitting the same IP address,and front-loads work on the IP addresses that are probably going totake the longest time to finish.

Has anyone else thought about or implemented a partitioning schemebased on IP addresses instead of lexical domain names?


Thanks,

- Chris

--
------------------------
Chris Schneider
TransPac Software, Inc.
[EMAIL PROTECTED]
------------------------


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] URL Partitioning (Lexical vs. IP Address)

Reply via email to