[ 
http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12437131 ] 
            
Andrzej Bialecki  commented on NUTCH-353:
-----------------------------------------

I think this issue requires more discussion, especially how it affects the 
linkdb.

Let's say that page A links to B, but B redirects to C. Issues to discuss:

* should we mark B as gone? we could do so, to prevent refetching. We should 
also store the redirect url in CrawlDatum.metaData. This redirect url may 
change in the future to some other value, but since no page is ever truly gone 
(we should retry it at some point in the future) we should be able to adjust 
the redirect info.

* for all practical purposes, C now becomes a replacement for B. Should we 
transfer all inlink information (anchor text, incoming urls, and score 
contributions) to C? From the implementation point of view this would require 
changes to linkdb format, to be able to create "aliases" that automatically 
transfer all inlink information to C even though it's inserted under B ..

> pages that serverside forwards will be refetched every time
> -----------------------------------------------------------
>
>                 Key: NUTCH-353
>                 URL: http://issues.apache.org/jira/browse/NUTCH-353
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0, 0.8.1
>            Reporter: Stefan Groschupf
>            Priority: Blocker
>             Fix For: 0.8.1
>
>         Attachments: doNotRefecthForwarderPagesV1.patch
>
>
> Pages that do a serverside forward are not written with a status change back 
> into the crawlDb. Also the nextFetchTime is not changed. 
> This causes a refetch of the same page again and again. The result is nutch 
> is not polite and refetching the forwarding and target page in each segment 
> iteration. Also it effects the scoring since the forward page contribute it's 
> score to all outlinks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to