Here is a lightly tested patch for the crawl delay that allows urls with crawl delay set greater than x number of seconds to be ignored. I have currently run this on over 2 million urls and it is working good. With this patch I also added this to my nutch-site.xml file for ignoring sites with crawl delay > 30 seconds. The value can be changed to suit. I have seen crawl delays as high at 259200 seconds in our crawls.

<property>
<name>http.max.crawl.delay</name>
<value>30</value>
<description>
If the crawl delay in robots.txt is set to greater than this value then the
fetcher will ignore this page.  If set to -1 the fetcher will never ignore
pages and will wait the amount of time retrieved from robots.txt crawl delay.
This can cause hung threads if the delay is >= task timeout value.  If all
threads get hung it can cause the fetcher task to about prematurely.
</description>
</property>

Thanks, this is useful. We did a survey of a bunch of sites, and found crawl delay values up to 99999 seconds.

The most recent patches to fetcher (not fetcher2) with NUTCH-334 seems to have speed up our fetching dramatically. We are only using about 50 fetchers but are consistently fetcher 1M + urls per day. The patch attatched and the 334 patches will help if staying on 0.8. If moving forward I think the new Fetcher2 codebase is a better solution though still a new one.

NUTCH-334 or NUTCH-344? I'm assuming the latter (http://issues.apache.org/jira/browse/NUTCH-344).

Thanks,

-- Ken


Ken Krugler wrote:
On 8/12/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
Hello,

Several people reported issues with slow fetcher in 0.8...

I run Nutch on a dual CPU (+HT) box, and have noticed that the fetch speed didn't increase when I went from using 100 threads, to 200 threads. Has anyone else observed the same?

I was using 2 map tasks (mapred.map.tasks property) in both cases, and the aggregate fetch speed was between 20 and 40 pages/sec. This was a fetch of 50K+ URLs from a diverse set of servers.

While crawling, strace -p<PID> and strace -ff<PID> shows a LOT of gettimeofday calls. Running strace several times in a row kept showing that gettimeofday is the most frequent system call. Has anyone tried tracing the fetcher process? Where do these calls come from? Any call to new Date() or Calendar.getInstance(), as must be done for every single logging call, perhaps?

I can certainly be impolite and lower fetcher.server.delay to 1 second or even 0, but I'd like to be polite.

I saw Ken Krugle's email suggesting to increast the number of fetcher threads to 2000+ and set the maximal java thread stack size to 512k with -Xss. Has anyone other than Ken tried this with success? Wouldn't the JVM go crazy context switching between this many threads?

Note that most of the time these fetcher threads are all blocked, waiting for other threads that are already fetching from the same IP address. So there's not a lot of thrashing.

I been working with 512k -Xss (Ken Krugle's suggestion) and it works
well. However number of fetcher for my part is 2500+.. I had to play
around with this number to match my bandwidth limitation, but now I
maximize my full bandwidth. But the problem that I run into are the
fetcher threads hangs, and for crawl delay/robots.txt file (Please see
Dennis Kubes posting on this).

Yes, these are definitely problems.

Stefan has been working on a queue-based fetcher that uses NIO. Seems very promising, but not yet ready for prime time.

-- Ken


Index: src/java/org/apache/nutch/protocol/ProtocolStatus.java
===================================================================
--- src/java/org/apache/nutch/protocol/ProtocolStatus.java (revision 428457)
+++ src/java/org/apache/nutch/protocol/ProtocolStatus.java      (working copy)
@@ -60,6 +60,10 @@
   public static final int NOTFETCHING          = 20;
   /** Unchanged since the last fetch. */
   public static final int NOTMODIFIED          = 21;
+  /** Request was refused by protocol plugins, because it would block.
+   * The expected number of milliseconds to wait before retry may be provided
+   * in args. */
+  public static final int WOULDBLOCK          = 22;
// Useful static instances for status codes that don't usually require any
   // additional arguments.
Index: src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
===================================================================
--- src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java (revision 430392) +++ src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java (working copy)
@@ -125,6 +125,9 @@

   /** Do we use HTTP/1.1? */
   protected boolean useHttp11 = false;
+ + /** Ignore page if crawl delay over this number of seconds */
+  private int maxCrawlDelay = -1;

   /** Creates a new instance of HttpBase */
   public HttpBase() {
@@ -152,6 +155,7 @@
this.userAgent = getAgentString(conf.get("http.agent.name"), conf.get("http.agent.version"), conf .get("http.agent.description"), conf.get("http.agent.url"), conf.get("http.agent.email")); this.serverDelay = (long) (conf.getFloat("fetcher.server.delay", 1.0f) * 1000);
+        this.maxCrawlDelay = (int) conf.getInt("http.max.crawl.delay", -1);
         // backward-compatible default setting
         this.byIP = conf.getBoolean("fetcher.threads.per.host.by.ip", true);
         this.useHttp11 = conf.getBoolean("http.http11", false);
@@ -185,6 +189,17 @@
long crawlDelay = robots.getCrawlDelay(this, u);
       long delay = crawlDelay > 0 ? crawlDelay : serverDelay;
+ + int crawlDelaySeconds = (int)(delay / 1000);
+      if (crawlDelaySeconds > maxCrawlDelay && maxCrawlDelay >= 0) {
+        LOGGER.info("Ignoring " + u + ", exceeds max crawl delay: max=" +
+          maxCrawlDelay + ", crawl delay=" + crawlDelaySeconds +
+          " seconds");
+        Content c = new Content(u.toString(), u.toString(), EMPTY_CONTENT,
+          null, null, this.conf);
+ return new ProtocolOutput(c, new ProtocolStatus(ProtocolStatus.WOULDBLOCK));
+      }
+ String host = blockAddr(u, delay);
       Response response;
       try {


--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Reply via email to