[ http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12423187 ] Andrzej Bialecki commented on NUTCH-322: -----------------------------------------
Good questions ... ;) ad 1: Google shows only the final page, and you can access it through both the original (starting) url and the final redirected url. You can't view the intermediate pages. To be Google-compatible we should index only the final page, but put it under both URLs. This is relatively easy to implement in Fetcher and index-basic, by appropriately marking the starting and intermediate pages, skipping any non-final pages during indexing, and then adding the original url to the final url when indexing the final page. Also, I think that if redirect refresh time is large (e.g. larger than 20 seconds) we should consider the pages to be separate, and treat them separately. ad 2: Google shows only inlinks going to the final url. However, the same inlinks can be obtained by using either the starting or the final url. OTOH MSN has separate inlinks in each case. I'm not sure yet how we should implement this... > Fetcher discards ProtocolStatus, doesn't store redirected pages > --------------------------------------------------------------- > > Key: NUTCH-322 > URL: http://issues.apache.org/jira/browse/NUTCH-322 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 0.8-dev > Reporter: Andrzej Bialecki > Fix For: 0.8-dev > > > Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus > contains important information, such as protocol-level response code, > lastModified time, and possibly other messages. > I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, > which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In > addition, if ProtocolStatus contains a valid lastModified time, that > CrawlDatum's modified time should also be set to this value. > Additionally, Fetcher doesn't store redirected pages. Content of such pages > is silently discarded. When Fetcher translates from protocol-level status to > crawldb-level status it should probably store such pages with the following > translation of status codes: > * ProtocolStatus.TEMP_MOVED -> CrawlDatum.STATUS_DB_RETRY. This code > indicates a transient change, so we probably shouldn't mark the initial URL > as bad. > * ProtocolStatus.MOVED -> CrawlDatum.STATUS_DB_GONE. This code indicates a > permanent change, so the initial URL is no longer valid, i.e. it will always > result in redirects. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
