max. redirects not handled correctly: fetcher stops at max-1 redirects
----------------------------------------------------------------------

                 Key: NUTCH-962
                 URL: https://issues.apache.org/jira/browse/NUTCH-962
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 1.2, 1.3, 2.0
            Reporter: Sebastian Nagel


The fetcher stops following redirects one redirect before the max. redirects is 
reached.

The description of http.redirect.max
> The maximum number of redirects the fetcher will follow when
> trying to fetch a page. If set to negative or 0, fetcher won't immediately
> follow redirected URLs, instead it will record them for later fetching.
suggests that if set to 1 that one redirect will be followed.

I tried to crawl two documents the first redirecting by
 <meta http-equiv="refresh" content="0; URL=./to/meta_refresh_target.html">
to the second with http.redirect.max = 1
The second document is not fetched and the URL has state GONE in CrawlDb.

fetching file:/test/redirects/meta_refresh.html
redirectCount=0
-finishing thread FetcherThread, activeThreads=1
 - content redirect to file:/test/redirects/to/meta_refresh_target.html 
(fetching now)
 - redirect count exceeded file:/test/redirects/to/meta_refresh_target.html

The attached patch would fix this: if http.redirect.max is 1 : one redirect is 
followed.
Of course, this would mean there is no possibility to skip redirects at all 
since 0
(as well as negative values) means "treat redirects as ordinary links".



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to