Sebastian Nagel created NUTCH-3114:
--------------------------------------

             Summary: Avoid stale fetching when only URLs from queues blocked 
by the exponential backoff remain 
                 Key: NUTCH-3114
                 URL: https://issues.apache.org/jira/browse/NUTCH-3114
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 1.19
            Reporter: Sebastian Nagel
            Assignee: Sebastian Nagel
             Fix For: 1.21


The exponential backoff (NUTCH-2946) politely slows down fetching from queues 
where requests fail repeatedly with exceptions or HTTP status codes (503, 403, 
429, etc.) mapped to the protocol status "EXCEPTION".

However, because the delay grows exponentially. Starting with the default fetch 
delay of 5 seconds, after the 8th exception the fetcher waits for five minutes. 
If all "good" queues are exhausted and there is no time limit 
({{fetcher.timelimit.mins}}) or minimum throughput 
({{fetcher.throughput.threshold.pages}}) configured, this may cause the 
fetching becomes stale and is finally stopped by the task timeout.

The default for {{fetcher.max.exceptions.per.queue}} should be set to a 
reasonable low value, so that queues where requests fail repeatedly with 
exceptions are purged. With the current default of {{-1}} queues are never 
purged.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to