URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and 
is generated over and over again
----------------------------------------------------------------------------------------------------------------

                 Key: NUTCH-1245
                 URL: https://issues.apache.org/jira/browse/NUTCH-1245
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 1.4, 1.5
            Reporter: Sebastian Nagel


A document gone with 404 after db.fetch.interval.max (90 days) has passed
is fetched over and over again but although fetch status is fetch_gone
its status in CrawlDb keeps db_unfetched. Consequently, this document will
be generated and fetched from now on in every cycle.

To reproduce:
# create a CrawlDatum in CrawlDb which retry interval hits 
db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule 
to achieve this)
# now this URL is fetched again
# but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
days)
# this does not change with every generate-fetch-update cycle, here for two 
segments:
{noformat}
/tmp/testcrawl/segments/20120105161430
SegmentReader: get 'http://localhost/page_gone'
Crawl Generate::
Status: 1 (db_unfetched)
Fetch time: Thu Jan 05 16:14:21 CET 2012
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 6998400 seconds (81 days)
Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
http://localhost/page_gone

Crawl Fetch::
Status: 37 (fetch_gone)
Fetch time: Thu Jan 05 16:14:48 CET 2012
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 6998400 seconds (81 days)
Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
http://localhost/page_gone


/tmp/testcrawl/segments/20120105161631
SegmentReader: get 'http://localhost/page_gone'
Crawl Generate::
Status: 1 (db_unfetched)
Fetch time: Thu Jan 05 16:16:23 CET 2012
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 6998400 seconds (81 days)
Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
http://localhost/page_gone

Crawl Fetch::
Status: 37 (fetch_gone)
Fetch time: Thu Jan 05 16:20:05 CET 2012
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 6998400 seconds (81 days)
Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
http://localhost/page_gone
{noformat}

As far as I can see it's caused by setPageGoneSchedule() in 
AbstractFetchSchedule. Some pseudo-code:
{code}
setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
    datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
maxInterval
    datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
    if (maxInterval < datum.fetchInterval) // necessarily true
       forceRefetch()

forceRefetch:
    if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
maxInterval
       datum.fetchInterval = 0.9 * maxInterval
    datum.status = db_unfetched // 


shouldFetch (called from generate / Generator.map):
    if ((datum.fetchTime - curTime) > maxInterval)
       // always true if the crawler is launched in short intervals
       // (lower than 0.35 * maxInterval)
       datum.fetchTime = curTime // forces a refetch
{code}
After setPageGoneSchedule is called via update the state is db_unfetched and 
the retry interval 0.9 * db.fetch.interval.max (81 days). 
Although the fetch time in the CrawlDb is far in the future
{noformat}
% nutch readdb testcrawl/crawldb -url http://localhost/page_gone
URL: http://localhost/page_gone
Version: 7
Status: 1 (db_unfetched)
Fetch time: Sun May 06 05:20:05 CEST 2012
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 6998400 seconds (81 days)
Score: 1.0
Signature: null
Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
{noformat}
the URL is generated again because (fetch time - current time) is larger than 
db.fetch.interval.max.
The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
the fetch time is always close to current time + 1.35 * db.fetch.interval.max.

It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to