Sebastian Nagel created NUTCH-3114:
--------------------------------------
Summary: Avoid stale fetching when only URLs from queues blocked
by the exponential backoff remain
Key: NUTCH-3114
URL: https://issues.apache.org/jira/browse/NUTCH-3114
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 1.19
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
Fix For: 1.21
The exponential backoff (NUTCH-2946) politely slows down fetching from queues
where requests fail repeatedly with exceptions or HTTP status codes (503, 403,
429, etc.) mapped to the protocol status "EXCEPTION".
However, because the delay grows exponentially. Starting with the default fetch
delay of 5 seconds, after the 8th exception the fetcher waits for five minutes.
If all "good" queues are exhausted and there is no time limit
({{fetcher.timelimit.mins}}) or minimum throughput
({{fetcher.throughput.threshold.pages}}) configured, this may cause the
fetching becomes stale and is finally stopped by the task timeout.
The default for {{fetcher.max.exceptions.per.queue}} should be set to a
reasonable low value, so that queues where requests fail repeatedly with
exceptions are purged. With the current default of {{-1}} queues are never
purged.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)