max. redirects not handled correctly: fetcher stops at max-1 redirects
----------------------------------------------------------------------
Key: NUTCH-962
URL: https://issues.apache.org/jira/browse/NUTCH-962
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 1.2, 1.3, 2.0
Reporter: Sebastian Nagel
The fetcher stops following redirects one redirect before the max. redirects is
reached.
The description of http.redirect.max
> The maximum number of redirects the fetcher will follow when
> trying to fetch a page. If set to negative or 0, fetcher won't immediately
> follow redirected URLs, instead it will record them for later fetching.
suggests that if set to 1 that one redirect will be followed.
I tried to crawl two documents the first redirecting by
<meta http-equiv="refresh" content="0; URL=./to/meta_refresh_target.html">
to the second with http.redirect.max = 1
The second document is not fetched and the URL has state GONE in CrawlDb.
fetching file:/test/redirects/meta_refresh.html
redirectCount=0
-finishing thread FetcherThread, activeThreads=1
- content redirect to file:/test/redirects/to/meta_refresh_target.html
(fetching now)
- redirect count exceeded file:/test/redirects/to/meta_refresh_target.html
The attached patch would fix this: if http.redirect.max is 1 : one redirect is
followed.
Of course, this would mean there is no possibility to skip redirects at all
since 0
(as well as negative values) means "treat redirects as ordinary links".
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.