[ 
https://issues.apache.org/jira/browse/NUTCH-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531662#comment-17531662
 ] 

Sebastian Nagel commented on NUTCH-2946:
----------------------------------------

Hi [~markus17], thanks for your remarks! An [exponential 
backoff|https://en.wikipedia.org/wiki/Exponential_backoff] is definitely the 
best way to handle such failed fetches. I've updated the PR: the delay is now 
doubled with every observed protocol exception. Here the fetcher log snippets 
from a test run:
{noformat}
2022-05-04 12:47:20,054 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 53 fetch of http://localhost/nutch/test-exception/10.html failed 
with: Http code=429, url=http://localhost/nutch/test-exception/10.html
2022-05-04 12:47:20,054 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue: 
localhost >> delayed next fetch by 1000 ms after 1 exceptions in queue
--
2022-05-04 12:47:22,069 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 53 fetch of http://localhost/nutch/test-exception/3.html failed 
with: Http code=429, url=http://localhost/nutch/test-exception/3.html
2022-05-04 12:47:22,070 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue: 
localhost >> delayed next fetch by 2000 ms after 2 exceptions in queue
--
2022-05-04 12:47:25,092 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 53 fetch of http://localhost/nutch/test-exception/6.html failed 
with: Http code=429, url=http://localhost/nutch/test-exception/6.html
2022-05-04 12:47:25,093 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue: 
localhost >> delayed next fetch by 4000 ms after 3 exceptions in queue
--
2022-05-04 12:47:30,116 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 53 fetch of http://localhost/nutch/test-exception/9.html failed 
with: Http code=429, url=http://localhost/nutch/test-exception/9.html
2022-05-04 12:47:30,116 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue: 
localhost >> delayed next fetch by 8000 ms after 4 exceptions in queue
--
2022-05-04 12:47:39,128 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 53 fetch of http://localhost/nutch/test-exception/1.html failed 
with: Http code=429, url=http://localhost/nutch/test-exception/1.html
2022-05-04 12:47:39,128 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue: 
localhost >> delayed next fetch by 16000 ms after 5 exceptions in queue
--
2022-05-04 12:47:56,144 INFO o.a.n.f.FetcherThread [FetcherThread] 
FetcherThread 53 fetch of http://localhost/nutch/test-exception/4.html failed 
with: Http code=429, url=http://localhost/nutch/test-exception/4.html
2022-05-04 12:47:56,145 INFO o.a.n.f.FetchItemQueues [FetcherThread] * queue: 
localhost >> delayed next fetch by 32000 ms after 6 exceptions in queue
{noformat}

For testing I've put the following .htaccess into one folder on my local Apache 
server root:
{noformat}
<ifModule mod_rewrite.c>
RewriteEngine On
Redirect 429 /
</ifModule>
{noformat}

> Fetcher: optionally slow down fetching from hosts with repeated exceptions
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-2946
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2946
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.18
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.19
>
>
> The fetcher holds for every fetch queue a counter which counts the number of 
> observed "exceptions" seen when fetching from the host (resp. domain or IP) 
> bound to this queue.
> As an improvement to increase the politeness of the crawler, the counter 
> value could be used to dynamically increase the fetch delay for hosts where 
> requests fail repeatedly with exceptions or HTTP status codes mapped to 
> ProtocolStatus.EXCEPTION (HTTP 403 Forbidden, 429 Too many requests, 5xx 
> server errors, etc.) Of course, this should be optional. The aim to reduce 
> the load on such hosts already before the configured max. number of 
> exceptions (property fetcher.max.exceptions.per.queue) is hit.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to