[ https://issues.apache.org/jira/browse/NUTCH-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18008455#comment-18008455 ]
Sebastian Nagel commented on NUTCH-3120: ---------------------------------------- Hi [~markus17], thanks! Yes slowing down on a HTTP 429 is necessary (or highly recommended) and this was already implemented in master: - HTTP 429 is mapped to {{ProtocolStatus.EXCEPTION}} and triggers the exponential backoff (NUTCH-2946). See also - [FetchItemQueues|https://github.com/apache/nutch/blob/2786b5a9baefc4803964a4c5197cfbfe32f8988e/src/java/org/apache/nutch/fetcher/FetchItemQueues.java#L334] for the implementation - configuration of the backoff in [nutch-default.xml|https://github.com/apache/nutch/blob/2786b5a9baefc4803964a4c5197cfbfe32f8988e/conf/nutch-default.xml#L1171] - NUTCH-2573 if a HTTP 429 is already seen when fetching the robots.txt - NUTCH-3067, NUTCH-3072, NUTCH-2992, NUTCH-2947 and NUTCH-2767 which fix bugs and improve the performance when handling throttled queues, especially if there are many of them - NUTCH-3114 to avoid that the fetching becomes stale when only URLs from throttled queues have left > Automatically increase crawl-delay on HTTP 429 > ---------------------------------------------- > > Key: NUTCH-3120 > URL: https://issues.apache.org/jira/browse/NUTCH-3120 > Project: Nutch > Issue Type: Improvement > Reporter: Markus Jelsma > Assignee: Markus Jelsma > Priority: Major > Fix For: 1.22 > > Attachments: NUTCH-3120-1.15.patch > > > Thought i remember a discussion or ticket on this subject, but it seems no > code of this sort is in master at the moment. > Anyway, small patch that adds HTTP429 to ProtocolStatus, the setter for that > status to HttpBase, and the reading of that status in FetcherThread, so it > can adjust the fetch speed (hardcoded to *3 right now). > > -- This message was sent by Atlassian Jira (v8.20.10#820010)