Sorry, yeah it was 344 not 334. Dennis
Ken Krugler wrote: >> Here is a lightly tested patch for the crawl delay that allows urls >> with crawl delay set greater than x number of seconds to be ignored. >> I have currently run this on over 2 million urls and it is working >> good. With this patch I also added this to my nutch-site.xml file >> for ignoring sites with crawl delay > 30 seconds. The value can be >> changed to suit. I have seen crawl delays as high at 259200 seconds >> in our crawls. >> >> <property> >> <name>http.max.crawl.delay</name> >> <value>30</value> >> <description> >> If the crawl delay in robots.txt is set to greater than this value >> then the >> fetcher will ignore this page. If set to -1 the fetcher will never >> ignore >> pages and will wait the amount of time retrieved from robots.txt >> crawl delay. >> This can cause hung threads if the delay is >= task timeout value. >> If all >> threads get hung it can cause the fetcher task to about prematurely. >> </description> >> </property> > > Thanks, this is useful. We did a survey of a bunch of sites, and found > crawl delay values up to 99999 seconds. > >> The most recent patches to fetcher (not fetcher2) with NUTCH-334 >> seems to have speed up our fetching dramatically. We are only using >> about 50 fetchers but are consistently fetcher 1M + urls per day. The >> patch attatched and the 334 patches will help if staying on 0.8. If >> moving forward I think the new Fetcher2 codebase is a better solution >> though still a new one. > > NUTCH-334 or NUTCH-344? I'm assuming the latter > (http://issues.apache.org/jira/browse/NUTCH-344). > > Thanks, > > -- Ken > > >> Ken Krugler wrote: >>>> On 8/12/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: >>>>> Hello, >>>>> >>>>> Several people reported issues with slow fetcher in 0.8... >>>>> >>>>> I run Nutch on a dual CPU (+HT) box, and have noticed that the >>>>> fetch speed didn't increase when I went from using 100 threads, to >>>>> 200 threads. Has anyone else observed the same? >>>>> >>>>> I was using 2 map tasks (mapred.map.tasks property) in both cases, >>>>> and the aggregate fetch speed was between 20 and 40 pages/sec. >>>>> This was a fetch of 50K+ URLs from a diverse set of servers. >>>>> >>>>> While crawling, strace -p<PID> and strace -ff<PID> shows a LOT of >>>>> gettimeofday calls. Running strace several times in a row kept >>>>> showing that gettimeofday is the most frequent system call. >>>>> Has anyone tried tracing the fetcher process? Where do these >>>>> calls come from? Any call to new Date() or >>>>> Calendar.getInstance(), as must be done for every single logging >>>>> call, perhaps? >>>>> >>>>> I can certainly be impolite and lower fetcher.server.delay to 1 >>>>> second or even 0, but I'd like to be polite. >>>>> >>>>> I saw Ken Krugle's email suggesting to increast the number of >>>>> fetcher threads to 2000+ and set the maximal java thread stack >>>>> size to 512k with -Xss. Has anyone other than Ken tried this with >>>>> success? Wouldn't the JVM go crazy context switching between this >>>>> many threads? >>> >>> Note that most of the time these fetcher threads are all blocked, >>> waiting for other threads that are already fetching from the same IP >>> address. So there's not a lot of thrashing. >>> >>>> I been working with 512k -Xss (Ken Krugle's suggestion) and it works >>>> well. However number of fetcher for my part is 2500+.. I had to play >>>> around with this number to match my bandwidth limitation, but now I >>>> maximize my full bandwidth. But the problem that I run into are the >>>> fetcher threads hangs, and for crawl delay/robots.txt file (Please see >>>> Dennis Kubes posting on this). >>> >>> Yes, these are definitely problems. >>> >>> Stefan has been working on a queue-based fetcher that uses NIO. >>> Seems very promising, but not yet ready for prime time. >>> >>> -- Ken >> >> >> Index: src/java/org/apache/nutch/protocol/ProtocolStatus.java >> =================================================================== >> --- src/java/org/apache/nutch/protocol/ProtocolStatus.java >> (revision 428457) >> +++ src/java/org/apache/nutch/protocol/ProtocolStatus.java >> (working copy) >> @@ -60,6 +60,10 @@ >> public static final int NOTFETCHING = 20; >> /** Unchanged since the last fetch. */ >> public static final int NOTMODIFIED = 21; >> + /** Request was refused by protocol plugins, because it would block. >> + * The expected number of milliseconds to wait before retry may be >> provided >> + * in args. */ >> + public static final int WOULDBLOCK = 22; >> // Useful static instances for status codes that don't usually >> require any >> // additional arguments. >> Index: >> src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java >> >> >> =================================================================== >> --- >> src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java >> >> (revision 430392) >> +++ >> src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java >> >> (working copy) >> @@ -125,6 +125,9 @@ >> >> /** Do we use HTTP/1.1? */ >> protected boolean useHttp11 = false; >> + + /** Ignore page if crawl delay over this number of seconds */ >> + private int maxCrawlDelay = -1; >> >> /** Creates a new instance of HttpBase */ >> public HttpBase() { >> @@ -152,6 +155,7 @@ >> this.userAgent = getAgentString(conf.get("http.agent.name"), >> conf.get("http.agent.version"), conf >> .get("http.agent.description"), >> conf.get("http.agent.url"), conf.get("http.agent.email")); >> this.serverDelay = (long) >> (conf.getFloat("fetcher.server.delay", 1.0f) * 1000); >> + this.maxCrawlDelay = (int) >> conf.getInt("http.max.crawl.delay", -1); >> // backward-compatible default setting >> this.byIP = >> conf.getBoolean("fetcher.threads.per.host.by.ip", true); >> this.useHttp11 = conf.getBoolean("http.http11", false); >> @@ -185,6 +189,17 @@ >> long crawlDelay = robots.getCrawlDelay(this, u); >> long delay = crawlDelay > 0 ? crawlDelay : serverDelay; >> + + int crawlDelaySeconds = (int)(delay / 1000); >> + if (crawlDelaySeconds > maxCrawlDelay && maxCrawlDelay >= 0) { >> + LOGGER.info("Ignoring " + u + ", exceeds max crawl delay: >> max=" + >> + maxCrawlDelay + ", crawl delay=" + crawlDelaySeconds + >> + " seconds"); >> + Content c = new Content(u.toString(), u.toString(), >> EMPTY_CONTENT, >> + null, null, this.conf); >> + return new ProtocolOutput(c, new >> ProtocolStatus(ProtocolStatus.WOULDBLOCK)); >> + } >> + String host = blockAddr(u, delay); >> Response response; >> try { > > ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
