[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

Markus Jelsma (JIRA) Thu, 08 Nov 2012 02:45:15 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493098#comment-13493098
 ]


Markus Jelsma commented on NUTCH-1245:
--------------------------------------

Thanks for the thorough unit tests, they clearly show there's a problem to be 
solved. I think i agree on the proposed fix you mention for 1245, it makes some 
sense. Not calling forceRefetch (it only leads to more transient errors) but 
setting fetch time to max interval to see again later sounds what one could 
expect.

On 578 and 1247, i think if we solve 578 overflowing may not be a big problem 
anymore. With Nutch as it works today it takes at least 128 days for it to 
overflow, if we fix it and people use a more reasonable max interval (say 30 
days or higher) it'll overflow 10 years from now, which i think is reasonable.

I'm not yet sure about the fix on 578. It's complex indeed ;)
                
> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1245
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1245
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4, 1.5
>            Reporter: Sebastian Nagel
>            Priority: Critical
>             Fix For: 1.6
>
>         Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, 
> NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
>     datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
>     datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
>     if (maxInterval < datum.fetchInterval) // necessarily true
>        forceRefetch()
> forceRefetch:
>     if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>        datum.fetchInterval = 0.9 * maxInterval
>     datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
>     if ((datum.fetchTime - curTime) > maxInterval)
>        // always true if the crawler is launched in short intervals
>        // (lower than 0.35 * maxInterval)
>        datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.interval.max.
> The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
> the fetch time is always close to current time + 1.35 * db.fetch.interval.max.
> It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

Reply via email to