[ http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12437131 ] Andrzej Bialecki commented on NUTCH-353: -----------------------------------------
I think this issue requires more discussion, especially how it affects the linkdb. Let's say that page A links to B, but B redirects to C. Issues to discuss: * should we mark B as gone? we could do so, to prevent refetching. We should also store the redirect url in CrawlDatum.metaData. This redirect url may change in the future to some other value, but since no page is ever truly gone (we should retry it at some point in the future) we should be able to adjust the redirect info. * for all practical purposes, C now becomes a replacement for B. Should we transfer all inlink information (anchor text, incoming urls, and score contributions) to C? From the implementation point of view this would require changes to linkdb format, to be able to create "aliases" that automatically transfer all inlink information to C even though it's inserted under B .. > pages that serverside forwards will be refetched every time > ----------------------------------------------------------- > > Key: NUTCH-353 > URL: http://issues.apache.org/jira/browse/NUTCH-353 > Project: Nutch > Issue Type: Bug > Affects Versions: 0.9.0, 0.8.1 > Reporter: Stefan Groschupf > Priority: Blocker > Fix For: 0.8.1 > > Attachments: doNotRefecthForwarderPagesV1.patch > > > Pages that do a serverside forward are not written with a status change back > into the crawlDb. Also the nextFetchTime is not changed. > This causes a refetch of the same page again and again. The result is nutch > is not polite and refetching the forwarding and target page in each segment > iteration. Also it effects the scoring since the forward page contribute it's > score to all outlinks. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
