[
https://issues.apache.org/jira/browse/NUTCH-2754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973401#comment-16973401
]
Sebastian Nagel commented on NUTCH-2754:
----------------------------------------
See also
[crawler-commons#276|https://github.com/crawler-commons/crawler-commons/issues/276]
and
[storm-crawler#768|https://github.com/DigitalPebble/storm-crawler/issues/768].
> fetcher.max.crawl.delay ignored if exceeding 5 min. / 300 sec.
> --------------------------------------------------------------
>
> Key: NUTCH-2754
> URL: https://issues.apache.org/jira/browse/NUTCH-2754
> Project: Nutch
> Issue Type: Bug
> Components: fetcher, robots
> Affects Versions: 1.16
> Reporter: Sebastian Nagel
> Priority: Major
> Fix For: 1.17
>
>
> Sites specifying a Crawl-Delay of more than 5 minutes (301 seconds or more)
> are always ignored, even if fetcher.max.crawl.delay is set to a higher value.
> We need to pass a higher value of fetcher.max.crawl.delay to
> [crawler-commons' robots.txt
> parser|https://github.com/crawler-commons/crawler-commons/blob/c9c0ac6eda91b13d534e69f6da3fd15065414fb0/src/main/java/crawlercommons/robots/SimpleRobotRulesParser.java#L78]
> otherwise it will use the internal default value of 300 sec. and disallow
> all sites specifying a longer Crawl-Delay in their robots.txt.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)