[ http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12428858 ] Stefan Groschupf commented on NUTCH-322: ----------------------------------------
I think this is a serious problem. Page A server side redirect to Page B. Page A is never writen to the output. That causes that Page A does not change the state or the next fetch time, what means that page A is fetched again, again, again ... ∞ I suggest that we write out Page A with a status change to STATUS_DB_GONE. > Fetcher discards ProtocolStatus, doesn't store redirected pages > --------------------------------------------------------------- > > Key: NUTCH-322 > URL: http://issues.apache.org/jira/browse/NUTCH-322 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 0.8 > Reporter: Andrzej Bialecki > Fix For: 0.9.0 > > > Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus > contains important information, such as protocol-level response code, > lastModified time, and possibly other messages. > I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, > which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In > addition, if ProtocolStatus contains a valid lastModified time, that > CrawlDatum's modified time should also be set to this value. > Additionally, Fetcher doesn't store redirected pages. Content of such pages > is silently discarded. When Fetcher translates from protocol-level status to > crawldb-level status it should probably store such pages with the following > translation of status codes: > * ProtocolStatus.TEMP_MOVED -> CrawlDatum.STATUS_DB_RETRY. This code > indicates a transient change, so we probably shouldn't mark the initial URL > as bad. > * ProtocolStatus.MOVED -> CrawlDatum.STATUS_DB_GONE. This code indicates a > permanent change, so the initial URL is no longer valid, i.e. it will always > result in redirects. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
