[ http://issues.apache.org/jira/browse/NUTCH-322?page=all ]
Andrzej Bialecki closed NUTCH-322. ----------------------------------- Resolution: Fixed Fixed in trunk/, rev. 490607 . NOTE: this doesn't solve the whole issue of proper handling of redirected pages from the point of view of scoring and LinkDB, but it does solve the original issue described here. > Fetcher discards ProtocolStatus, doesn't store redirected pages > --------------------------------------------------------------- > > Key: NUTCH-322 > URL: http://issues.apache.org/jira/browse/NUTCH-322 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 0.8 > Reporter: Andrzej Bialecki > Assigned To: Andrzej Bialecki > Fix For: 0.9.0 > > > Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus > contains important information, such as protocol-level response code, > lastModified time, and possibly other messages. > I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, > which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In > addition, if ProtocolStatus contains a valid lastModified time, that > CrawlDatum's modified time should also be set to this value. > Additionally, Fetcher doesn't store redirected pages. Content of such pages > is silently discarded. When Fetcher translates from protocol-level status to > crawldb-level status it should probably store such pages with the following > translation of status codes: > * ProtocolStatus.TEMP_MOVED -> CrawlDatum.STATUS_DB_RETRY. This code > indicates a transient change, so we probably shouldn't mark the initial URL > as bad. > * ProtocolStatus.MOVED -> CrawlDatum.STATUS_DB_GONE. This code indicates a > permanent change, so the initial URL is no longer valid, i.e. it will always > result in redirects. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers