[ http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12422508 ] Andrzej Bialecki commented on NUTCH-322: -----------------------------------------
I hope I don't come across as arguing ... just trying to explain the rationale for this. Redirected pages often have content - take a look at e.g. http://dmoz.org/Arts (notice missing ending slash). I agree that most of the time this content is trivial, but we always read this content anyway. In some cases, it's not the content but metadata (HTTP headers) that are important - start a protocol analyzer and look what happens when you try to visit http://www.svd.se/annonsera (again, no slash at the end) - the second redirect will also set a cookie, which may be important for further requests in this session. And Nutch stores metadata only when it stores the content ... There is also a case of content-level redirection, caused by <meta http-equiv="refresh" ...>, where you most likely get a full page of content, and then after a while you get redirected to another page. This may be immediately, but it also may be after 120 seconds - so, the intermediate content does matter in this case. > Fetcher discards ProtocolStatus, doesn't store redirected pages > --------------------------------------------------------------- > > Key: NUTCH-322 > URL: http://issues.apache.org/jira/browse/NUTCH-322 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 0.8-dev > Reporter: Andrzej Bialecki > Fix For: 0.8-dev > > > Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus > contains important information, such as protocol-level response code, > lastModified time, and possibly other messages. > I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, > which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In > addition, if ProtocolStatus contains a valid lastModified time, that > CrawlDatum's modified time should also be set to this value. > Additionally, Fetcher doesn't store redirected pages. Content of such pages > is silently discarded. When Fetcher translates from protocol-level status to > crawldb-level status it should probably store such pages with the following > translation of status codes: > * ProtocolStatus.TEMP_MOVED -> CrawlDatum.STATUS_DB_RETRY. This code > indicates a transient change, so we probably shouldn't mark the initial URL > as bad. > * ProtocolStatus.MOVED -> CrawlDatum.STATUS_DB_GONE. This code indicates a > permanent change, so the initial URL is no longer valid, i.e. it will always > result in redirects. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
