[
https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13538725#comment-13538725
]
Tejas Patil commented on NUTCH-1284:
------------------------------------
I searched for the relevant mail thread[0] to get an idea why this bug was
created.
Quick recap of the issue:
Despite fetcher.max.crawl.delay was set to -1, nutch was marking the url as
ROBOTS_DENIED. With fetcher.max.crawl.delay= -1, the expected behavior is to
wait the amount of time retrieved from robots.txt Crawl-Delay, however long
that might be.
Lewis could reproduce the issue. He suggested the change mentioned in the bug
and hinted that there might be some problem with that property.
An additional condition was needed to be changed which prevents urls from being
marked DB_GONE when fetcher.max.crawl.delay= -1 (ie. maxCrawlDelay = -1000).
After this change, I tested with the scenario mentioned in [0] and it worked
fine.
[0]:
http://lucene.472066.n3.nabble.com/Re-Re-Re-Re-fetcher-max-crawl-delay-1-doesn-t-work-tc3749639.html
> Add site fetcher.max.crawl.delay as log output by default.
> ----------------------------------------------------------
>
> Key: NUTCH-1284
> URL: https://issues.apache.org/jira/browse/NUTCH-1284
> Project: Nutch
> Issue Type: New Feature
> Components: fetcher
> Affects Versions: nutchgora, 1.5
> Reporter: Lewis John McGibbney
> Priority: Trivial
> Fix For: 1.7
>
> Attachments: NUTCH-1284.patch
>
>
> Currently, when manually scanning our log output we cannot infer which pages
> are governed by a crawl delay between successive fetch attempts of any given
> page within the site. The value should be made available as something like:
> {code}
> 2012-02-19 12:33:33,031 INFO fetcher.Fetcher - fetching
> http://nutch.apache.org/ (crawl.delay=XXXms)
> {code}
> This way we can easily and quickly determine whether the fetcher is having to
> use this functionality or not.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira