[
https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche resolved NUTCH-385.
---------------------------------
Resolution: Not a Problem
This is not a problem but a discussion of how things work in the Fetcher. Not
action needed.
> Server delay feature conflicts with maxThreadsPerHost
> -----------------------------------------------------
>
> Key: NUTCH-385
> URL: https://issues.apache.org/jira/browse/NUTCH-385
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Reporter: Chris Schneider
>
> For some time I've been puzzled by the interaction between two paramters that
> control how often the fetcher can access a particular host:
> 1) The server delay, which comes back from the remote server during our
> processing of the robots.txt file, and which can be limited by
> fetcher.max.crawl.delay.
> 2) The fetcher.threads.per.host value, particularly when this is greater than
> the default of 1.
> According to my (limited) understanding of the code in HttpBase.java:
> Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher
> ends up keeping either 1 or 2 fetcher threads pointing at a particular host
> continuously. In other words, it never tries to point 3 at the host, and it
> always points a second thread at the host before the first thread finishes
> accessing it. Since HttpBase.unblockAddr never gets called with
> (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts
> System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the
> host. Thus, the server delay will never be used at all. The fetcher will be
> continuously retrieving pages from the host, often with 2 fetchers accessing
> the host simultaneously.
> Suppose instead that the fetcher finally does allow the last thread to
> complete before it gets around to pointing another thread at the target host.
> When the last fetcher thread calls HttpBase.unblockAddr, it will now put
> System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the
> host. This, in turn, will prevent any threads from accessing this host until
> the delay is complete, even though zero threads are currently accessing the
> host.
> I see this behavior as inconsistent. More importantly, the current
> implementation certainly doesn't seem to answer my original question about
> appropriate definitions for what appear to be conflicting parameters.
> In a nutshell, how could we possibly honor the server delay if we allow more
> than one fetcher thread to simultaneously access the host?
> It would be one thing if whenever (fetcher.threads.per.host > 1), this
> trumped the server delay, causing the latter to be ignored completely. That
> is certainly not the case in the current implementation, as it will wait for
> server delay whenever the number of threads accessing a given host drops to
> zero.
--
This message was sent by Atlassian JIRA
(v6.2#6252)