Chris Schneider wrote:
My experience recently seeing attempted fetches of many ingrida.be URLs made me question the Nutch 0.8 algorithm for partitioning URLs among TaskTrackers (and their children processes). As I understand it, Nutch doesn't worry about two lexically distinct domains (e.g., inherit-the-wind.ingrida.be and clancy-brown.ingrida.be) being fetched simultaneously, even though they might actually resolve to the same IP address (66.154.11.25 in this case).

That is correct, Nutch 0.8 currently treats each lexicially-distinct domain as a separate domain. IP-based partitioning is possible: one would merely need to change PartitionUrlByHost.java to hash the IP of the host. If the performance of this is too slow, we could cache the IP address in the CrawlDatum, which is available when we are performing this partitioning. But probably one should run a caching DNS server when fetching anyway, so hopefully that would not be required.

I've attached a patch. Tell me if it works and if it noticeably slows fetching for you.

Doug
Index: src/java/org/apache/nutch/crawl/PartitionUrlByHost.java
===================================================================
--- src/java/org/apache/nutch/crawl/PartitionUrlByHost.java	(revision 379848)
+++ src/java/org/apache/nutch/crawl/PartitionUrlByHost.java	(working copy)
@@ -17,6 +17,8 @@
 package org.apache.nutch.crawl;
 
 import java.net.URL;
+import java.net.InetAddress;
+import java.net.UnknownHostException;
 import java.net.MalformedURLException;
 
 import org.apache.hadoop.io.*;
@@ -41,8 +43,22 @@
       url = new URL(urlString);
     } catch (MalformedURLException e) {
     }
-    int hashCode = (url==null ? urlString : url.getHost()).hashCode();
 
+    int hashCode;
+
+    if (url == null) {
+      hashCode = urlString.hashCode();
+    } else {
+      String host = url.getHost();
+      try {
+        InetAddress addr = InetAddress.getByName(host);
+        hashCode = addr.hashCode();
+      } catch (UnknownHostException e) {
+        Generator.LOG.info("Couldn't find IP for host: " + host);
+        hashCode = host.hashCode();
+      }
+    }
+
     // make hosts wind up in different partitions on different runs
     hashCode ^= seed;
 
@@ -50,5 +66,3 @@
   }
   
 }
-
-

Reply via email to