[ http://issues.apache.org/jira/browse/NUTCH-385?page=comments#action_12444162 ] Ken Krugler commented on NUTCH-385: -----------------------------------
There is a middle ground, though we don't know yet how important it is to address. When we crawl partner sites, we sometimes have the OK to crawl faster than 1 thread/host with zero delay. But we do still need to worry about the total load that we put on their servers. So this is an example of a "crawling quickly" case where we don't control the site, and there is a need to be polite - but the definition of politeness is variable. Typically we get information about good times of certain days to crawl, when the partner ops group knows that they traditionally have low loads. What we don't yet know is whether we need better granularity than just N threads per host, where N > 1, assuming a zero delay between requests. With one thread and the crawl delay, you can gradually crank down the politeness level. With N threads and zero crawl delay, you get a 2x rate increase going from 1 to 2 threads. But ours is an unusual case, so I'd be OK with ignoring the crawl delay (with a warning if it's > 0) when threads per host > 1. > Server delay feature conflicts with maxThreadsPerHost > ----------------------------------------------------- > > Key: NUTCH-385 > URL: http://issues.apache.org/jira/browse/NUTCH-385 > Project: Nutch > Issue Type: Bug > Components: fetcher > Reporter: Chris Schneider > > For some time I've been puzzled by the interaction between two paramters that > control how often the fetcher can access a particular host: > 1) The server delay, which comes back from the remote server during our > processing of the robots.txt file, and which can be limited by > fetcher.max.crawl.delay. > 2) The fetcher.threads.per.host value, particularly when this is greater than > the default of 1. > According to my (limited) understanding of the code in HttpBase.java: > Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher > ends up keeping either 1 or 2 fetcher threads pointing at a particular host > continuously. In other words, it never tries to point 3 at the host, and it > always points a second thread at the host before the first thread finishes > accessing it. Since HttpBase.unblockAddr never gets called with > (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts > System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the > host. Thus, the server delay will never be used at all. The fetcher will be > continuously retrieving pages from the host, often with 2 fetchers accessing > the host simultaneously. > Suppose instead that the fetcher finally does allow the last thread to > complete before it gets around to pointing another thread at the target host. > When the last fetcher thread calls HttpBase.unblockAddr, it will now put > System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the > host. This, in turn, will prevent any threads from accessing this host until > the delay is complete, even though zero threads are currently accessing the > host. > I see this behavior as inconsistent. More importantly, the current > implementation certainly doesn't seem to answer my original question about > appropriate definitions for what appear to be conflicting parameters. > In a nutshell, how could we possibly honor the server delay if we allow more > than one fetcher thread to simultaneously access the host? > It would be one thing if whenever (fetcher.threads.per.host > 1), this > trumped the server delay, causing the latter to be ignored completely. That > is certainly not the case in the current implementation, as it will wait for > server delay whenever the number of threads accessing a given host drops to > zero. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
