[ http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12422996 ] Enrico Triolo commented on NUTCH-322: -------------------------------------
Ok, I can see your point, nevertheless I think we should consider some potential problems that could arise from such modifications: 1. When a redirect occours, both the redirecting and the redirected pages should be indexed, independently of crawling depth, but I think this is what you meant from the beginning... 2. How should linkdb updated? Or better, should linkdb be updated somehow? I mean, if page A has a link to page B, and page B redirects to C, should we set an incoming link to C from A? > Fetcher discards ProtocolStatus, doesn't store redirected pages > --------------------------------------------------------------- > > Key: NUTCH-322 > URL: http://issues.apache.org/jira/browse/NUTCH-322 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 0.8-dev > Reporter: Andrzej Bialecki > Fix For: 0.8-dev > > > Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus > contains important information, such as protocol-level response code, > lastModified time, and possibly other messages. > I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, > which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In > addition, if ProtocolStatus contains a valid lastModified time, that > CrawlDatum's modified time should also be set to this value. > Additionally, Fetcher doesn't store redirected pages. Content of such pages > is silently discarded. When Fetcher translates from protocol-level status to > crawldb-level status it should probably store such pages with the following > translation of status codes: > * ProtocolStatus.TEMP_MOVED -> CrawlDatum.STATUS_DB_RETRY. This code > indicates a transient change, so we probably shouldn't mark the initial URL > as bad. > * ProtocolStatus.MOVED -> CrawlDatum.STATUS_DB_GONE. This code indicates a > permanent change, so the initial URL is no longer valid, i.e. it will always > result in redirects. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
