[ 
https://issues.apache.org/jira/browse/NUTCH-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated NUTCH-2550:
-------------------------------
    Description: 
As I detailed in this github 
[comment|https://github.com/apache/nutch/commit/c93d908bb635d3c5b59f8c8a22e0584ebf588794#r28470348],
 it appears that PR #221 broke redirects. The fetcher will repeatedly fetch the 
*original url* rather than the one it's supposed to be redirecting to until 
{{http.redirect.max}} is exceeded, and then end with {{STATUS_FETCH_GONE}}.

I noticed this issue when I was trying to crawl a site with a 301 MOVED 
PERMANENTLY status code.

Should be pretty easy to fix though: I was able to get redirects working again 
simply by inserting the code {{url = fit.url}} 
[here|https://github.com/apache/nutch/blob/8682b96c3b84018f187eabaadc096ceded34f250/src/java/org/apache/nutch/fetcher/FetcherThread.java#L388]
 and 
[here|https://github.com/apache/nutch/blob/8682b96c3b84018f187eabaadc096ceded34f250/src/java/org/apache/nutch/fetcher/FetcherThread.java#L409].

  was:As I detailed in this github 
[comment|https://github.com/apache/nutch/commit/c93d908bb635d3c5b59f8c8a22e0584ebf588794#r28470348],
 it appears that PR #221 broke redirects. The fetcher will repeatedly fetch the 
*original url* rather than the one it's supposed to be redirecting to until 
{{http.redirect.max}} is exceeded, and then end with {{STATUS_FETCH_GONE}}.


> Redirects are broken
> --------------------
>
>                 Key: NUTCH-2550
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2550
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.15
>            Reporter: Hans Brende
>            Priority: Blocker
>             Fix For: 1.15
>
>
> As I detailed in this github 
> [comment|https://github.com/apache/nutch/commit/c93d908bb635d3c5b59f8c8a22e0584ebf588794#r28470348],
>  it appears that PR #221 broke redirects. The fetcher will repeatedly fetch 
> the *original url* rather than the one it's supposed to be redirecting to 
> until {{http.redirect.max}} is exceeded, and then end with 
> {{STATUS_FETCH_GONE}}.
> I noticed this issue when I was trying to crawl a site with a 301 MOVED 
> PERMANENTLY status code.
> Should be pretty easy to fix though: I was able to get redirects working 
> again simply by inserting the code {{url = fit.url}} 
> [here|https://github.com/apache/nutch/blob/8682b96c3b84018f187eabaadc096ceded34f250/src/java/org/apache/nutch/fetcher/FetcherThread.java#L388]
>  and 
> [here|https://github.com/apache/nutch/blob/8682b96c3b84018f187eabaadc096ceded34f250/src/java/org/apache/nutch/fetcher/FetcherThread.java#L409].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to