Hi everbody, I want to learn the error percent when using 2000+ threads. When I use the number of threads equal to 2000 I got %80-90 error. When we use 1000 thread we got %20 error. Is there any relation between the number of threads and error? Can anyone give any configuration with 2000+ threads getting lower errors? We have random set of urls from diverse set of hosts.
Dennis Kubes wrote: > Sorry, yeah it was 344 not 334. > > Dennis > > Ken Krugler wrote: > >>> Here is a lightly tested patch for the crawl delay that allows urls >>> with crawl delay set greater than x number of seconds to be ignored. >>> I have currently run this on over 2 million urls and it is working >>> good. With this patch I also added this to my nutch-site.xml file >>> for ignoring sites with crawl delay > 30 seconds. The value can be >>> changed to suit. I have seen crawl delays as high at 259200 seconds >>> in our crawls. >>> >>> <property> >>> <name>http.max.crawl.delay</name> >>> <value>30</value> >>> <description> >>> If the crawl delay in robots.txt is set to greater than this value >>> then the >>> fetcher will ignore this page. If set to -1 the fetcher will never >>> ignore >>> pages and will wait the amount of time retrieved from robots.txt >>> crawl delay. >>> This can cause hung threads if the delay is >= task timeout value. >>> If all >>> threads get hung it can cause the fetcher task to about prematurely. >>> </description> >>> </property> >> >> >> Thanks, this is useful. We did a survey of a bunch of sites, and >> found crawl delay values up to 99999 seconds. >> >>> The most recent patches to fetcher (not fetcher2) with NUTCH-334 >>> seems to have speed up our fetching dramatically. We are only using >>> about 50 fetchers but are consistently fetcher 1M + urls per day. >>> The patch attatched and the 334 patches will help if staying on 0.8. >>> If moving forward I think the new Fetcher2 codebase is a better >>> solution though still a new one. >> >> >> NUTCH-334 or NUTCH-344? I'm assuming the latter >> (http://issues.apache.org/jira/browse/NUTCH-344). >> >> Thanks, >> >> -- Ken >> >> >>> Ken Krugler wrote: >>> >>>>> On 8/12/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> Several people reported issues with slow fetcher in 0.8... >>>>>> >>>>>> I run Nutch on a dual CPU (+HT) box, and have noticed that the >>>>>> fetch speed didn't increase when I went from using 100 threads, >>>>>> to 200 threads. Has anyone else observed the same? >>>>>> >>>>>> I was using 2 map tasks (mapred.map.tasks property) in both >>>>>> cases, and the aggregate fetch speed was between 20 and 40 >>>>>> pages/sec. This was a fetch of 50K+ URLs from a diverse set of >>>>>> servers. >>>>>> >>>>>> While crawling, strace -p<PID> and strace -ff<PID> shows a LOT of >>>>>> gettimeofday calls. Running strace several times in a row kept >>>>>> showing that gettimeofday is the most frequent system call. >>>>>> Has anyone tried tracing the fetcher process? Where do these >>>>>> calls come from? Any call to new Date() or >>>>>> Calendar.getInstance(), as must be done for every single logging >>>>>> call, perhaps? >>>>>> >>>>>> I can certainly be impolite and lower fetcher.server.delay to 1 >>>>>> second or even 0, but I'd like to be polite. >>>>>> >>>>>> I saw Ken Krugle's email suggesting to increast the number of >>>>>> fetcher threads to 2000+ and set the maximal java thread stack >>>>>> size to 512k with -Xss. Has anyone other than Ken tried this >>>>>> with success? Wouldn't the JVM go crazy context switching >>>>>> between this many threads? >>>>> >>>> >>>> Note that most of the time these fetcher threads are all blocked, >>>> waiting for other threads that are already fetching from the same >>>> IP address. So there's not a lot of thrashing. >>>> >>>>> I been working with 512k -Xss (Ken Krugle's suggestion) and it works >>>>> well. However number of fetcher for my part is 2500+.. I had to play >>>>> around with this number to match my bandwidth limitation, but now I >>>>> maximize my full bandwidth. But the problem that I run into are the >>>>> fetcher threads hangs, and for crawl delay/robots.txt file (Please >>>>> see >>>>> Dennis Kubes posting on this). >>>> >>>> >>>> Yes, these are definitely problems. >>>> >>>> Stefan has been working on a queue-based fetcher that uses NIO. >>>> Seems very promising, but not yet ready for prime time. >>>> >>>> -- Ken >>> >>> >>> >>> Index: src/java/org/apache/nutch/protocol/ProtocolStatus.java >>> =================================================================== >>> --- src/java/org/apache/nutch/protocol/ProtocolStatus.java >>> (revision 428457) >>> +++ src/java/org/apache/nutch/protocol/ProtocolStatus.java >>> (working copy) >>> @@ -60,6 +60,10 @@ >>> public static final int NOTFETCHING = 20; >>> /** Unchanged since the last fetch. */ >>> public static final int NOTMODIFIED = 21; >>> + /** Request was refused by protocol plugins, because it would block. >>> + * The expected number of milliseconds to wait before retry may >>> be provided >>> + * in args. */ >>> + public static final int WOULDBLOCK = 22; >>> // Useful static instances for status codes that don't usually >>> require any >>> // additional arguments. >>> Index: >>> src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java >>> >>> >>> =================================================================== >>> --- >>> src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java >>> >>> (revision 430392) >>> +++ >>> src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java >>> >>> (working copy) >>> @@ -125,6 +125,9 @@ >>> >>> /** Do we use HTTP/1.1? */ >>> protected boolean useHttp11 = false; >>> + + /** Ignore page if crawl delay over this number of seconds */ >>> + private int maxCrawlDelay = -1; >>> >>> /** Creates a new instance of HttpBase */ >>> public HttpBase() { >>> @@ -152,6 +155,7 @@ >>> this.userAgent = >>> getAgentString(conf.get("http.agent.name"), >>> conf.get("http.agent.version"), conf >>> .get("http.agent.description"), >>> conf.get("http.agent.url"), conf.get("http.agent.email")); >>> this.serverDelay = (long) >>> (conf.getFloat("fetcher.server.delay", 1.0f) * 1000); >>> + this.maxCrawlDelay = (int) >>> conf.getInt("http.max.crawl.delay", -1); >>> // backward-compatible default setting >>> this.byIP = >>> conf.getBoolean("fetcher.threads.per.host.by.ip", true); >>> this.useHttp11 = conf.getBoolean("http.http11", false); >>> @@ -185,6 +189,17 @@ >>> long crawlDelay = robots.getCrawlDelay(this, u); >>> long delay = crawlDelay > 0 ? crawlDelay : serverDelay; >>> + + int crawlDelaySeconds = (int)(delay / 1000); >>> + if (crawlDelaySeconds > maxCrawlDelay && maxCrawlDelay >= 0) { >>> + LOGGER.info("Ignoring " + u + ", exceeds max crawl delay: >>> max=" + >>> + maxCrawlDelay + ", crawl delay=" + crawlDelaySeconds + >>> + " seconds"); >>> + Content c = new Content(u.toString(), u.toString(), >>> EMPTY_CONTENT, >>> + null, null, this.conf); >>> + return new ProtocolOutput(c, new >>> ProtocolStatus(ProtocolStatus.WOULDBLOCK)); >>> + } >>> + String host = blockAddr(u, delay); >>> Response response; >>> try { >> >> >> > > > ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
