Sebastian Nagel created NUTCH-1422:
--------------------------------------

             Summary: reset signature for redirects
                 Key: NUTCH-1422
                 URL: https://issues.apache.org/jira/browse/NUTCH-1422
             Project: Nutch
          Issue Type: Bug
          Components: crawldb, fetcher
    Affects Versions: 1.4
            Reporter: Sebastian Nagel
             Fix For: 1.6


In a long running continuous crawl with Nutch 1.4 URLs with a HTTP redirect 
(http.redirect.max = 0) are kept as not-modified in the CrawlDb. Short protocol 
(cf. attached dumped segment / CrawlDb data):
 2012-02-23 :  injected
 2012-02-24 :  fetched
 2012-03-30 :  re-fetched, signature changed
 2012-04-20 :  re-fetched, redirected
 2012-04-24 :  in CrawlDb as db_notmodified, still indexed with old content!

The signature of a previously fetched document is not reset when the URL/doc is 
changed to a redirect at a later time. CrawlDbReducer.reduce then sets the 
status to db_notmodified because the new signature in with fetch status is 
identical to the old one.

Possible fixes (??):
* reset the signature in Fetcher
* handle this case in CrawlDbReducer.reduce


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to