[ 
http://issues.apache.org/jira/browse/NUTCH-273?page=comments#action_12430117 ] 
            
Chris Schneider commented on NUTCH-273:
---------------------------------------

Another reason why it would be better to wait until the next segment to process 
the target of the redirect is that this target may already have been fetched. 
In this case, there's no need to refetch it. More importantly, though, 
refetching the page will cause its OPIC score to be distributed a second time 
to its outlinks. In fact, each page that redirects to the target page will 
cause the target page's OPIC score to get redistributed.

I honestly can't see a good reason for doing an immediate redirect, since 
hopefully these cases aren't common enough to make a significant difference to 
crawling performance.

Note that there are several other issues related to this issue, so we should 
take care to satisfy the goals of all with any fix. In particular, I agree that 
we should be saving more information in the metadata about the redirection (as 
well as other protocol cases).

> When a page is redirected, the original url is NOT updated.
> -----------------------------------------------------------
>
>                 Key: NUTCH-273
>                 URL: http://issues.apache.org/jira/browse/NUTCH-273
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8
>         Environment: n/a
>            Reporter: Lukas Vlcek
>
> [Excerpt from maillist, sender: Andrzej Bialecki]
> When a page is redirected, the original url is NOT updated - so, CrawlDB will 
> never know that a redirect occured, it won't even know that a fetch 
> occured... This looks like a bug.
> In 0.7 this was recorded in the segment, and then it would affect the Page 
> status during updatedb. It should do so 0.8, too...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to