[ http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12422409 ] Enrico Triolo commented on NUTCH-322: -------------------------------------
I probably miss something, but redirected pages don't have content, they only return a 30x status in the http header... Why would you need to fetch those pages? In my opinion it would be better if we store only the urls of the intermediate redirected pages into the ProtocolStatus of the 'final' page. It's only my two cents ;-) > Fetcher discards ProtocolStatus, doesn't store redirected pages > --------------------------------------------------------------- > > Key: NUTCH-322 > URL: http://issues.apache.org/jira/browse/NUTCH-322 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 0.8-dev > Reporter: Andrzej Bialecki > Fix For: 0.8-dev > > > Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus > contains important information, such as protocol-level response code, > lastModified time, and possibly other messages. > I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, > which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In > addition, if ProtocolStatus contains a valid lastModified time, that > CrawlDatum's modified time should also be set to this value. > Additionally, Fetcher doesn't store redirected pages. Content of such pages > is silently discarded. When Fetcher translates from protocol-level status to > crawldb-level status it should probably store such pages with the following > translation of status codes: > * ProtocolStatus.TEMP_MOVED -> CrawlDatum.STATUS_DB_RETRY. This code > indicates a transient change, so we probably shouldn't mark the initial URL > as bad. > * ProtocolStatus.MOVED -> CrawlDatum.STATUS_DB_GONE. This code indicates a > permanent change, so the initial URL is no longer valid, i.e. it will always > result in redirects. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
