Ken Krugler wrote:
On 8/12/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
Hello,
Several people reported issues with slow fetcher in 0.8...
I run Nutch on a dual CPU (+HT) box, and have noticed that the
fetch speed didn't increase when I went from using 100 threads,
to 200 threads. Has anyone else observed the same?
I was using 2 map tasks (mapred.map.tasks property) in both
cases, and the aggregate fetch speed was between 20 and 40
pages/sec. This was a fetch of 50K+ URLs from a diverse set of
servers.
While crawling, strace -p<PID> and strace -ff<PID> shows a LOT of
gettimeofday calls. Running strace several times in a row kept
showing that gettimeofday is the most frequent system call.
Has anyone tried tracing the fetcher process? Where do these
calls come from? Any call to new Date() or
Calendar.getInstance(), as must be done for every single logging
call, perhaps?
I can certainly be impolite and lower fetcher.server.delay to 1
second or even 0, but I'd like to be polite.
I saw Ken Krugle's email suggesting to increast the number of
fetcher threads to 2000+ and set the maximal java thread stack
size to 512k with -Xss. Has anyone other than Ken tried this
with success? Wouldn't the JVM go crazy context switching
between this many threads?
Note that most of the time these fetcher threads are all blocked,
waiting for other threads that are already fetching from the same
IP address. So there's not a lot of thrashing.
I been working with 512k -Xss (Ken Krugle's suggestion) and it works
well. However number of fetcher for my part is 2500+.. I had to play
around with this number to match my bandwidth limitation, but now I
maximize my full bandwidth. But the problem that I run into are the
fetcher threads hangs, and for crawl delay/robots.txt file (Please
see
Dennis Kubes posting on this).
Yes, these are definitely problems.
Stefan has been working on a queue-based fetcher that uses NIO.
Seems very promising, but not yet ready for prime time.
-- Ken
Index: src/java/org/apache/nutch/protocol/ProtocolStatus.java
===================================================================
--- src/java/org/apache/nutch/protocol/ProtocolStatus.java
(revision 428457)
+++ src/java/org/apache/nutch/protocol/ProtocolStatus.java
(working copy)
@@ -60,6 +60,10 @@
public static final int NOTFETCHING = 20;
/** Unchanged since the last fetch. */
public static final int NOTMODIFIED = 21;
+ /** Request was refused by protocol plugins, because it would block.
+ * The expected number of milliseconds to wait before retry may
be provided
+ * in args. */
+ public static final int WOULDBLOCK = 22;
// Useful static instances for status codes that don't usually
require any
// additional arguments.
Index:
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
===================================================================
---
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
(revision 430392)
+++
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
(working copy)
@@ -125,6 +125,9 @@
/** Do we use HTTP/1.1? */
protected boolean useHttp11 = false;
+ + /** Ignore page if crawl delay over this number of seconds */
+ private int maxCrawlDelay = -1;
/** Creates a new instance of HttpBase */
public HttpBase() {
@@ -152,6 +155,7 @@
this.userAgent =
getAgentString(conf.get("http.agent.name"),
conf.get("http.agent.version"), conf
.get("http.agent.description"),
conf.get("http.agent.url"), conf.get("http.agent.email"));
this.serverDelay = (long)
(conf.getFloat("fetcher.server.delay", 1.0f) * 1000);
+ this.maxCrawlDelay = (int)
conf.getInt("http.max.crawl.delay", -1);
// backward-compatible default setting
this.byIP =
conf.getBoolean("fetcher.threads.per.host.by.ip", true);
this.useHttp11 = conf.getBoolean("http.http11", false);
@@ -185,6 +189,17 @@
long crawlDelay = robots.getCrawlDelay(this, u);
long delay = crawlDelay > 0 ? crawlDelay : serverDelay;
+ + int crawlDelaySeconds = (int)(delay / 1000);
+ if (crawlDelaySeconds > maxCrawlDelay && maxCrawlDelay >= 0) {
+ LOGGER.info("Ignoring " + u + ", exceeds max crawl delay:
max=" +
+ maxCrawlDelay + ", crawl delay=" + crawlDelaySeconds +
+ " seconds");
+ Content c = new Content(u.toString(), u.toString(),
EMPTY_CONTENT,
+ null, null, this.conf);
+ return new ProtocolOutput(c, new
ProtocolStatus(ProtocolStatus.WOULDBLOCK));
+ }
+ String host = blockAddr(u, delay);
Response response;
try {