[
https://issues.apache.org/jira/browse/NUTCH-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477667#comment-17477667
]
Hudson commented on NUTCH-2573:
-------------------------------
SUCCESS: Integrated in Jenkins build Nutch ยป Nutch-trunk #71 (See
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/71/])
NUTCH-2573 Suspend crawling if robots.txt fails to fetch with 5xx status (#724)
(github:
[https://github.com/apache/nutch/commit/f691baebc3c04c08ea500f4767e2decb88c30c70])
* (edit) src/java/org/apache/nutch/fetcher/FetchItemQueues.java
* (edit) src/java/org/apache/nutch/protocol/RobotRulesParser.java
* (edit) conf/nutch-default.xml
* (edit) src/java/org/apache/nutch/fetcher/FetcherThread.java
* (edit)
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java
> Suspend crawling if robots.txt fails to fetch with 5xx status
> -------------------------------------------------------------
>
> Key: NUTCH-2573
> URL: https://issues.apache.org/jira/browse/NUTCH-2573
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher
> Affects Versions: 1.14
> Reporter: Sebastian Nagel
> Assignee: Sebastian Nagel
> Priority: Major
> Fix For: 1.19
>
>
> Fetcher should optionally (by default) suspend crawling by a configurable
> interval when fetching the robots.txt fails with a server errors (HTTP status
> code 5xx, esp. 503) following [Google's spec|
> https://developers.google.com/search/reference/robots_txt#handling-http-result-codes]:
> ??5xx (server error)??
> ??Server errors are seen as temporary errors that result in a "full disallow"
> of crawling. The request is retried until a non-server-error HTTP result code
> is obtained. A 503 (Service Unavailable) error will result in fairly frequent
> retrying. To temporarily suspend crawling, it is recommended to serve a 503
> HTTP result code. Handling of a permanent server error is undefined.??
> See also the [draft robots.txt RFC, section "Unreachable
> status"|https://datatracker.ietf.org/doc/html/draft-koster-rep-06#section-2.3.1.4].
> Crawler-commons robots rules already provide
> [isDeferVisits|https://crawler-commons.github.io/crawler-commons/1.2/crawlercommons/robots/BaseRobotRules.html#isDeferVisits--]
> to store this information (must be set from RobotRulesParser).
--
This message was sent by Atlassian Jira
(v8.20.1#820001)