Sebastian Nagel created NUTCH-1422:
--------------------------------------
Summary: reset signature for redirects
Key: NUTCH-1422
URL: https://issues.apache.org/jira/browse/NUTCH-1422
Project: Nutch
Issue Type: Bug
Components: crawldb, fetcher
Affects Versions: 1.4
Reporter: Sebastian Nagel
Fix For: 1.6
In a long running continuous crawl with Nutch 1.4 URLs with a HTTP redirect
(http.redirect.max = 0) are kept as not-modified in the CrawlDb. Short protocol
(cf. attached dumped segment / CrawlDb data):
2012-02-23 : injected
2012-02-24 : fetched
2012-03-30 : re-fetched, signature changed
2012-04-20 : re-fetched, redirected
2012-04-24 : in CrawlDb as db_notmodified, still indexed with old content!
The signature of a previously fetched document is not reset when the URL/doc is
changed to a redirect at a later time. CrawlDbReducer.reduce then sets the
status to db_notmodified because the new signature in with fetch status is
identical to the old one.
Possible fixes (??):
* reset the signature in Fetcher
* handle this case in CrawlDbReducer.reduce
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira