[
https://issues.apache.org/jira/browse/NUTCH-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18008455#comment-18008455
]
Sebastian Nagel commented on NUTCH-3120:
----------------------------------------
Hi [~markus17],
thanks! Yes slowing down on a HTTP 429 is necessary (or highly recommended) and
this was already implemented in master:
- HTTP 429 is mapped to {{ProtocolStatus.EXCEPTION}} and triggers the
exponential backoff (NUTCH-2946). See also
-
[FetchItemQueues|https://github.com/apache/nutch/blob/2786b5a9baefc4803964a4c5197cfbfe32f8988e/src/java/org/apache/nutch/fetcher/FetchItemQueues.java#L334]
for the implementation
- configuration of the backoff in
[nutch-default.xml|https://github.com/apache/nutch/blob/2786b5a9baefc4803964a4c5197cfbfe32f8988e/conf/nutch-default.xml#L1171]
- NUTCH-2573 if a HTTP 429 is already seen when fetching the robots.txt
- NUTCH-3067, NUTCH-3072, NUTCH-2992, NUTCH-2947 and NUTCH-2767 which fix bugs
and improve the performance when handling throttled queues, especially if there
are many of them
- NUTCH-3114 to avoid that the fetching becomes stale when only URLs from
throttled queues have left
> Automatically increase crawl-delay on HTTP 429
> ----------------------------------------------
>
> Key: NUTCH-3120
> URL: https://issues.apache.org/jira/browse/NUTCH-3120
> Project: Nutch
> Issue Type: Improvement
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Major
> Fix For: 1.22
>
> Attachments: NUTCH-3120-1.15.patch
>
>
> Thought i remember a discussion or ticket on this subject, but it seems no
> code of this sort is in master at the moment.
> Anyway, small patch that adds HTTP429 to ProtocolStatus, the setter for that
> status to HttpBase, and the reading of that status in FetcherThread, so it
> can adjust the fetch speed (hardcoded to *3 right now).
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)