[
https://issues.apache.org/jira/browse/NUTCH-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477361#comment-17477361
]
ASF GitHub Bot commented on NUTCH-2573:
---------------------------------------
sebastian-nagel commented on pull request #724:
URL: https://github.com/apache/nutch/pull/724#issuecomment-1014808017
Hi @lewismc, done: updated metrics wiki page (hitByTimeLimit is already
documented), added Javadocs and renamed the counter to follow the naming
convention of the other robots_* counters. Also renamed the method
("timelimitReached" -> "timelimitExceeded").
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
> Suspend crawling if robots.txt fails to fetch with 5xx status
> -------------------------------------------------------------
>
> Key: NUTCH-2573
> URL: https://issues.apache.org/jira/browse/NUTCH-2573
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher
> Affects Versions: 1.14
> Reporter: Sebastian Nagel
> Assignee: Sebastian Nagel
> Priority: Major
> Fix For: 1.19
>
>
> Fetcher should optionally (by default) suspend crawling by a configurable
> interval when fetching the robots.txt fails with a server errors (HTTP status
> code 5xx, esp. 503) following [Google's spec|
> https://developers.google.com/search/reference/robots_txt#handling-http-result-codes]:
> ??5xx (server error)??
> ??Server errors are seen as temporary errors that result in a "full disallow"
> of crawling. The request is retried until a non-server-error HTTP result code
> is obtained. A 503 (Service Unavailable) error will result in fairly frequent
> retrying. To temporarily suspend crawling, it is recommended to serve a 503
> HTTP result code. Handling of a permanent server error is undefined.??
> See also the [draft robots.txt RFC, section "Unreachable
> status"|https://datatracker.ietf.org/doc/html/draft-koster-rep-06#section-2.3.1.4].
> Crawler-commons robots rules already provide
> [isDeferVisits|https://crawler-commons.github.io/crawler-commons/1.2/crawlercommons/robots/BaseRobotRules.html#isDeferVisits--]
> to store this information (must be set from RobotRulesParser).
--
This message was sent by Atlassian Jira
(v8.20.1#820001)