Re: On fetcher slowness

Ken Krugler Sun, 13 Aug 2006 17:16:54 -0700

Here is a lightly tested patch for the crawl delay that allows urlswith crawl delay set greater than x number of seconds to be ignored.I have currently run this on over 2 million urls and it is workinggood. With this patch I also added this to my nutch-site.xml filefor ignoring sites with crawl delay > 30 seconds. The value can bechanged to suit. I have seen crawl delays as high at 259200 secondsin our crawls.
<property>
<name>http.max.crawl.delay</name>
<value>30</value>
<description>
If the crawl delay in robots.txt is set to greater than this value then the
fetcher will ignore this page.  If set to -1 the fetcher will never ignore
pages and will wait the amount of time retrieved from robots.txt crawl delay.
This can cause hung threads if the delay is >= task timeout value.  If all
threads get hung it can cause the fetcher task to about prematurely.
</description>
</property>

Thanks, this is useful. We did a survey of a bunch of sites, andfound crawl delay values up to 99999 seconds.

The most recent patches to fetcher (not fetcher2) with NUTCH-334seems to have speed up our fetching dramatically. We are only usingabout 50 fetchers but are consistently fetcher 1M + urls per day.The patch attatched and the 334 patches will help if staying on 0.8.If moving forward I think the new Fetcher2 codebase is a bettersolution though still a new one.

NUTCH-334 or NUTCH-344? I'm assuming the latter(http://issues.apache.org/jira/browse/NUTCH-344).


Thanks,

-- Ken

Ken Krugler wrote:
On 8/12/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
Hello,

Several people reported issues with slow fetcher in 0.8...
I run Nutch on a dual CPU (+HT) box, and have noticed that thefetch speed didn't increase when I went from using 100 threads,to 200 threads. Has anyone else observed the same?
I was using 2 map tasks (mapred.map.tasks property) in bothcases, and the aggregate fetch speed was between 20 and 40pages/sec. This was a fetch of 50K+ URLs from a diverse set ofservers.
While crawling, strace -p<PID> and strace -ff<PID> shows a LOT ofgettimeofday calls. Running strace several times in a row keptshowing that gettimeofday is the most frequent system call.Has anyone tried tracing the fetcher process? Where do thesecalls come from? Any call to new Date() orCalendar.getInstance(), as must be done for every single loggingcall, perhaps?
I can certainly be impolite and lower fetcher.server.delay to 1second or even 0, but I'd like to be polite.
I saw Ken Krugle's email suggesting to increast the number offetcher threads to 2000+ and set the maximal java thread stacksize to 512k with -Xss. Has anyone other than Ken tried thiswith success? Wouldn't the JVM go crazy context switchingbetween this many threads?
Note that most of the time these fetcher threads are all blocked,waiting for other threads that are already fetching from the sameIP address. So there's not a lot of thrashing.
I been working with 512k -Xss (Ken Krugle's suggestion) and it works
well. However number of fetcher for my part is 2500+.. I had to play
around with this number to match my bandwidth limitation, but now I
maximize my full bandwidth. But the problem that I run into are the
fetcher threads hangs, and for crawl delay/robots.txt file (Please see
Dennis Kubes posting on this).
Yes, these are definitely problems.
Stefan has been working on a queue-based fetcher that uses NIO.Seems very promising, but not yet ready for prime time.
-- Ken
Index: src/java/org/apache/nutch/protocol/ProtocolStatus.java
===================================================================
--- src/java/org/apache/nutch/protocol/ProtocolStatus.java(revision 428457)
+++ src/java/org/apache/nutch/protocol/ProtocolStatus.java      (working copy)
@@ -60,6 +60,10 @@
   public static final int NOTFETCHING          = 20;
   /** Unchanged since the last fetch. */
   public static final int NOTMODIFIED          = 21;
+  /** Request was refused by protocol plugins, because it would block.
+   * The expected number of milliseconds to wait before retry may be provided
+   * in args. */
+  public static final int WOULDBLOCK          = 22;
// Useful static instances for status codes that don't usually require any
   // additional arguments.
Index:src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
===================================================================
---src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java(revision 430392)+++src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java(working copy)
@@ -125,6 +125,9 @@

   /** Do we use HTTP/1.1? */
   protected boolean useHttp11 = false;
++ /** Ignore page if crawl delay over this number of seconds */
+  private int maxCrawlDelay = -1;

   /** Creates a new instance of HttpBase */
   public HttpBase() {
@@ -152,6 +155,7 @@
this.userAgent =getAgentString(conf.get("http.agent.name"),conf.get("http.agent.version"), conf.get("http.agent.description"),conf.get("http.agent.url"), conf.get("http.agent.email"));this.serverDelay = (long)(conf.getFloat("fetcher.server.delay", 1.0f) * 1000);
+        this.maxCrawlDelay = (int) conf.getInt("http.max.crawl.delay", -1);
         // backward-compatible default setting
         this.byIP = conf.getBoolean("fetcher.threads.per.host.by.ip", true);
         this.useHttp11 = conf.getBoolean("http.http11", false);
@@ -185,6 +189,17 @@
long crawlDelay = robots.getCrawlDelay(this, u);
       long delay = crawlDelay > 0 ? crawlDelay : serverDelay;
++ int crawlDelaySeconds = (int)(delay / 1000);
+      if (crawlDelaySeconds > maxCrawlDelay && maxCrawlDelay >= 0) {
+        LOGGER.info("Ignoring " + u + ", exceeds max crawl delay: max=" +
+          maxCrawlDelay + ", crawl delay=" + crawlDelaySeconds +
+          " seconds");
+        Content c = new Content(u.toString(), u.toString(), EMPTY_CONTENT,
+          null, null, this.conf);
+ return new ProtocolOutput(c, newProtocolStatus(ProtocolStatus.WOULDBLOCK));
+      }
+String host = blockAddr(u, delay);
       Response response;
       try {



--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Re: On fetcher slowness

Reply via email to