[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812289#comment-13812289 ] Hudson commented on NUTCH-1245: --- SUCCESS: Integrated in Nutch-nutchgora #808 (See [https://builds.apache.org/job/Nutch-nutchgora/808/]) NUTCH-1588 Port NUTCH-1245 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again to 2.x (lewismc: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1538200) * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again Key: NUTCH-1245 URL: https://issues.apache.org/jira/browse/NUTCH-1245 Project: Nutch Issue Type: Bug Affects Versions: 1.4, 1.5 Reporter: Sebastian Nagel Priority: Critical Fix For: 1.7 Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch A document gone with 404 after db.fetch.interval.max (90 days) has passed is fetched over and over again but although fetch status is fetch_gone its status in CrawlDb keeps db_unfetched. Consequently, this document will be generated and fetched from now on in every cycle. To reproduce: # create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this) # now this URL is fetched again # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days) # this does not change with every generate-fetch-update cycle, here for two segments: {noformat} /tmp/testcrawl/segments/20120105161430 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:14:21 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:14:48 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone /tmp/testcrawl/segments/20120105161631 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:16:23 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:20:05 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code: {code} setPageGoneSchedule (called from update / CrawlDbReducer.reduce): datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 if (maxInterval datum.fetchInterval) // necessarily true forceRefetch() forceRefetch: if (datum.fetchInterval maxInterval) // true because it's 1.35 * maxInterval datum.fetchInterval = 0.9 * maxInterval datum.status = db_unfetched // shouldFetch (called from generate / Generator.map): if ((datum.fetchTime - curTime) maxInterval) // always true if the crawler is launched in short intervals // (lower than 0.35 * maxInterval) datum.fetchTime = curTime // forces a refetch {code} After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days). Although the fetch time in the CrawlDb is far in the future {noformat} % nutch readdb testcrawl/crawldb -url http://localhost/page_gone URL: http://localhost/page_gone Version: 7 Status: 1 (db_unfetched) Fetch time: Sun May 06 05:20:05 CEST 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Score: 1.0 Signature: null Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} the URL is generated again because (fetch time -
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13687867#comment-13687867 ] Markus Jelsma commented on NUTCH-1245: -- I think we should also include this one in the 1.7RC, it is even more important than the issue we committed for the FreeGenerator. We're using this in production for a long time now, all reported issues are no longer a problem. URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again Key: NUTCH-1245 URL: https://issues.apache.org/jira/browse/NUTCH-1245 Project: Nutch Issue Type: Bug Affects Versions: 1.4, 1.5 Reporter: Sebastian Nagel Priority: Critical Fix For: 1.8 Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch A document gone with 404 after db.fetch.interval.max (90 days) has passed is fetched over and over again but although fetch status is fetch_gone its status in CrawlDb keeps db_unfetched. Consequently, this document will be generated and fetched from now on in every cycle. To reproduce: # create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this) # now this URL is fetched again # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days) # this does not change with every generate-fetch-update cycle, here for two segments: {noformat} /tmp/testcrawl/segments/20120105161430 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:14:21 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:14:48 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone /tmp/testcrawl/segments/20120105161631 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:16:23 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:20:05 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code: {code} setPageGoneSchedule (called from update / CrawlDbReducer.reduce): datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 if (maxInterval datum.fetchInterval) // necessarily true forceRefetch() forceRefetch: if (datum.fetchInterval maxInterval) // true because it's 1.35 * maxInterval datum.fetchInterval = 0.9 * maxInterval datum.status = db_unfetched // shouldFetch (called from generate / Generator.map): if ((datum.fetchTime - curTime) maxInterval) // always true if the crawler is launched in short intervals // (lower than 0.35 * maxInterval) datum.fetchTime = curTime // forces a refetch {code} After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days). Although the fetch time in the CrawlDb is far in the future {noformat} % nutch readdb testcrawl/crawldb -url http://localhost/page_gone URL: http://localhost/page_gone Version: 7 Status: 1 (db_unfetched) Fetch time: Sun May 06 05:20:05 CEST 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Score: 1.0 Signature: null Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} the URL is generated again because (fetch time - current time) is larger than db.fetch.interval.max. The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and the fetch time is always close to current time + 1.35 *
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13688065#comment-13688065 ] Sebastian Nagel commented on NUTCH-1245: Definitely, [~markus17]. I'll commit today evening but without the unit tests. These will be added together with NUTCH-1502 early in 1.8 (hope to get them ready and complete soon). URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again Key: NUTCH-1245 URL: https://issues.apache.org/jira/browse/NUTCH-1245 Project: Nutch Issue Type: Bug Affects Versions: 1.4, 1.5 Reporter: Sebastian Nagel Priority: Critical Fix For: 1.8 Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch A document gone with 404 after db.fetch.interval.max (90 days) has passed is fetched over and over again but although fetch status is fetch_gone its status in CrawlDb keeps db_unfetched. Consequently, this document will be generated and fetched from now on in every cycle. To reproduce: # create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this) # now this URL is fetched again # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days) # this does not change with every generate-fetch-update cycle, here for two segments: {noformat} /tmp/testcrawl/segments/20120105161430 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:14:21 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:14:48 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone /tmp/testcrawl/segments/20120105161631 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:16:23 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:20:05 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code: {code} setPageGoneSchedule (called from update / CrawlDbReducer.reduce): datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 if (maxInterval datum.fetchInterval) // necessarily true forceRefetch() forceRefetch: if (datum.fetchInterval maxInterval) // true because it's 1.35 * maxInterval datum.fetchInterval = 0.9 * maxInterval datum.status = db_unfetched // shouldFetch (called from generate / Generator.map): if ((datum.fetchTime - curTime) maxInterval) // always true if the crawler is launched in short intervals // (lower than 0.35 * maxInterval) datum.fetchTime = curTime // forces a refetch {code} After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days). Although the fetch time in the CrawlDb is far in the future {noformat} % nutch readdb testcrawl/crawldb -url http://localhost/page_gone URL: http://localhost/page_gone Version: 7 Status: 1 (db_unfetched) Fetch time: Sun May 06 05:20:05 CEST 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Score: 1.0 Signature: null Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} the URL is generated again because (fetch time - current time) is larger than db.fetch.interval.max. The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and the fetch time is always close to current time + 1.35 * db.fetch.interval.max. It's possibly a side
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13688111#comment-13688111 ] Lewis John McGibbney commented on NUTCH-1245: - push it. get it in there. I'll cut the RC tonight my time. ;) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again Key: NUTCH-1245 URL: https://issues.apache.org/jira/browse/NUTCH-1245 Project: Nutch Issue Type: Bug Affects Versions: 1.4, 1.5 Reporter: Sebastian Nagel Priority: Critical Fix For: 1.8 Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch A document gone with 404 after db.fetch.interval.max (90 days) has passed is fetched over and over again but although fetch status is fetch_gone its status in CrawlDb keeps db_unfetched. Consequently, this document will be generated and fetched from now on in every cycle. To reproduce: # create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this) # now this URL is fetched again # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days) # this does not change with every generate-fetch-update cycle, here for two segments: {noformat} /tmp/testcrawl/segments/20120105161430 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:14:21 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:14:48 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone /tmp/testcrawl/segments/20120105161631 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:16:23 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:20:05 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code: {code} setPageGoneSchedule (called from update / CrawlDbReducer.reduce): datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 if (maxInterval datum.fetchInterval) // necessarily true forceRefetch() forceRefetch: if (datum.fetchInterval maxInterval) // true because it's 1.35 * maxInterval datum.fetchInterval = 0.9 * maxInterval datum.status = db_unfetched // shouldFetch (called from generate / Generator.map): if ((datum.fetchTime - curTime) maxInterval) // always true if the crawler is launched in short intervals // (lower than 0.35 * maxInterval) datum.fetchTime = curTime // forces a refetch {code} After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days). Although the fetch time in the CrawlDb is far in the future {noformat} % nutch readdb testcrawl/crawldb -url http://localhost/page_gone URL: http://localhost/page_gone Version: 7 Status: 1 (db_unfetched) Fetch time: Sun May 06 05:20:05 CEST 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Score: 1.0 Signature: null Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} the URL is generated again because (fetch time - current time) is larger than db.fetch.interval.max. The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and the fetch time is always close to current time + 1.35 * db.fetch.interval.max. It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578 -- This message is automatically generated by JIRA. If
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13688145#comment-13688145 ] Markus Jelsma commented on NUTCH-1245: -- Splendid! Thanks guys! URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again Key: NUTCH-1245 URL: https://issues.apache.org/jira/browse/NUTCH-1245 Project: Nutch Issue Type: Bug Affects Versions: 1.4, 1.5 Reporter: Sebastian Nagel Priority: Critical Fix For: 1.8 Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch A document gone with 404 after db.fetch.interval.max (90 days) has passed is fetched over and over again but although fetch status is fetch_gone its status in CrawlDb keeps db_unfetched. Consequently, this document will be generated and fetched from now on in every cycle. To reproduce: # create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this) # now this URL is fetched again # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days) # this does not change with every generate-fetch-update cycle, here for two segments: {noformat} /tmp/testcrawl/segments/20120105161430 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:14:21 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:14:48 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone /tmp/testcrawl/segments/20120105161631 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:16:23 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:20:05 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code: {code} setPageGoneSchedule (called from update / CrawlDbReducer.reduce): datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 if (maxInterval datum.fetchInterval) // necessarily true forceRefetch() forceRefetch: if (datum.fetchInterval maxInterval) // true because it's 1.35 * maxInterval datum.fetchInterval = 0.9 * maxInterval datum.status = db_unfetched // shouldFetch (called from generate / Generator.map): if ((datum.fetchTime - curTime) maxInterval) // always true if the crawler is launched in short intervals // (lower than 0.35 * maxInterval) datum.fetchTime = curTime // forces a refetch {code} After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days). Although the fetch time in the CrawlDb is far in the future {noformat} % nutch readdb testcrawl/crawldb -url http://localhost/page_gone URL: http://localhost/page_gone Version: 7 Status: 1 (db_unfetched) Fetch time: Sun May 06 05:20:05 CEST 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Score: 1.0 Signature: null Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} the URL is generated again because (fetch time - current time) is larger than db.fetch.interval.max. The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and the fetch time is always close to current time + 1.35 * db.fetch.interval.max. It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13688533#comment-13688533 ] Hudson commented on NUTCH-1245: --- Integrated in Nutch-trunk #2248 (See [https://builds.apache.org/job/Nutch-trunk/2248/]) NUTCH-1245 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again (Revision 1494776) Result = SUCCESS snagel : http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1494776 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again Key: NUTCH-1245 URL: https://issues.apache.org/jira/browse/NUTCH-1245 Project: Nutch Issue Type: Bug Affects Versions: 1.4, 1.5 Reporter: Sebastian Nagel Priority: Critical Fix For: 1.8 Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch A document gone with 404 after db.fetch.interval.max (90 days) has passed is fetched over and over again but although fetch status is fetch_gone its status in CrawlDb keeps db_unfetched. Consequently, this document will be generated and fetched from now on in every cycle. To reproduce: # create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this) # now this URL is fetched again # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days) # this does not change with every generate-fetch-update cycle, here for two segments: {noformat} /tmp/testcrawl/segments/20120105161430 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:14:21 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:14:48 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone /tmp/testcrawl/segments/20120105161631 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:16:23 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:20:05 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code: {code} setPageGoneSchedule (called from update / CrawlDbReducer.reduce): datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 if (maxInterval datum.fetchInterval) // necessarily true forceRefetch() forceRefetch: if (datum.fetchInterval maxInterval) // true because it's 1.35 * maxInterval datum.fetchInterval = 0.9 * maxInterval datum.status = db_unfetched // shouldFetch (called from generate / Generator.map): if ((datum.fetchTime - curTime) maxInterval) // always true if the crawler is launched in short intervals // (lower than 0.35 * maxInterval) datum.fetchTime = curTime // forces a refetch {code} After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days). Although the fetch time in the CrawlDb is far in the future {noformat} % nutch readdb testcrawl/crawldb -url http://localhost/page_gone URL: http://localhost/page_gone Version: 7 Status: 1 (db_unfetched) Fetch time: Sun May 06 05:20:05 CEST 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Score: 1.0 Signature: null Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} the URL is generated again because (fetch
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13546925#comment-13546925 ] Markus Jelsma commented on NUTCH-1245: -- Yes, and it also fixes the problem. URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again Key: NUTCH-1245 URL: https://issues.apache.org/jira/browse/NUTCH-1245 Project: Nutch Issue Type: Bug Affects Versions: 1.4, 1.5 Reporter: Sebastian Nagel Priority: Critical Fix For: 1.7 Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch A document gone with 404 after db.fetch.interval.max (90 days) has passed is fetched over and over again but although fetch status is fetch_gone its status in CrawlDb keeps db_unfetched. Consequently, this document will be generated and fetched from now on in every cycle. To reproduce: # create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this) # now this URL is fetched again # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days) # this does not change with every generate-fetch-update cycle, here for two segments: {noformat} /tmp/testcrawl/segments/20120105161430 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:14:21 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:14:48 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone /tmp/testcrawl/segments/20120105161631 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:16:23 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:20:05 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code: {code} setPageGoneSchedule (called from update / CrawlDbReducer.reduce): datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 if (maxInterval datum.fetchInterval) // necessarily true forceRefetch() forceRefetch: if (datum.fetchInterval maxInterval) // true because it's 1.35 * maxInterval datum.fetchInterval = 0.9 * maxInterval datum.status = db_unfetched // shouldFetch (called from generate / Generator.map): if ((datum.fetchTime - curTime) maxInterval) // always true if the crawler is launched in short intervals // (lower than 0.35 * maxInterval) datum.fetchTime = curTime // forces a refetch {code} After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days). Although the fetch time in the CrawlDb is far in the future {noformat} % nutch readdb testcrawl/crawldb -url http://localhost/page_gone URL: http://localhost/page_gone Version: 7 Status: 1 (db_unfetched) Fetch time: Sun May 06 05:20:05 CEST 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Score: 1.0 Signature: null Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} the URL is generated again because (fetch time - current time) is larger than db.fetch.interval.max. The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and the fetch time is always close to current time + 1.35 * db.fetch.interval.max. It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578 -- This message is automatically generated by JIRA. If you think it was sent incorrectly,
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13539943#comment-13539943 ] Markus Jelsma commented on NUTCH-1245: -- There's an issue with the patch after all! {code} URL: Version: 7 Status: 6 (db_notmodified) Fetch time: Thu Dec 20 00:19:09 UTC 2012 Modified time: Wed May 16 12:48:30 UTC 2012 Retries since fetch: 0 Retry interval: 5184000 seconds (60 days) Score: 0.0 Signature: b1fa188be92a8dfa5db51e80c4af192a Metadata: Content-Type: application/xhtml+xml_pst_: success(1), lastModified=0 {code} {code} URL: http://www.remeha.nl/intelligentenergy/index.php/remeha_evita_hre_ketel/subsidie/ Version: 7 Status: 6 (db_notmodified) Fetch time: Thu Dec 20 00:38:19 UTC 2012 Modified time: Wed May 16 12:48:30 UTC 2012 Retries since fetch: 0 Retry interval: 5184000 seconds (60 days) Score: 0.0 Signature: b1fa188be92a8dfa5db51e80c4af192a Metadata: Content-Type: application/xhtml+xml_pst_: success(1), lastModified=0 {code} The fetch time is not incremented at all. URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again Key: NUTCH-1245 URL: https://issues.apache.org/jira/browse/NUTCH-1245 Project: Nutch Issue Type: Bug Affects Versions: 1.4, 1.5 Reporter: Sebastian Nagel Priority: Critical Fix For: 1.7 Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch A document gone with 404 after db.fetch.interval.max (90 days) has passed is fetched over and over again but although fetch status is fetch_gone its status in CrawlDb keeps db_unfetched. Consequently, this document will be generated and fetched from now on in every cycle. To reproduce: # create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this) # now this URL is fetched again # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days) # this does not change with every generate-fetch-update cycle, here for two segments: {noformat} /tmp/testcrawl/segments/20120105161430 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:14:21 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:14:48 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone /tmp/testcrawl/segments/20120105161631 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:16:23 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:20:05 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0:
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13537860#comment-13537860 ] Markus Jelsma commented on NUTCH-1245: -- Keep in mind, the DummyReporter needs to implement getProgress() at least on Hadoop 1.1.1. {code} public float getProgress() { return 1f; } {code} URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again Key: NUTCH-1245 URL: https://issues.apache.org/jira/browse/NUTCH-1245 Project: Nutch Issue Type: Bug Affects Versions: 1.4, 1.5 Reporter: Sebastian Nagel Priority: Critical Fix For: 1.7 Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch A document gone with 404 after db.fetch.interval.max (90 days) has passed is fetched over and over again but although fetch status is fetch_gone its status in CrawlDb keeps db_unfetched. Consequently, this document will be generated and fetched from now on in every cycle. To reproduce: # create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this) # now this URL is fetched again # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days) # this does not change with every generate-fetch-update cycle, here for two segments: {noformat} /tmp/testcrawl/segments/20120105161430 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:14:21 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:14:48 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone /tmp/testcrawl/segments/20120105161631 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:16:23 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:20:05 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code: {code} setPageGoneSchedule (called from update / CrawlDbReducer.reduce): datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 if (maxInterval datum.fetchInterval) // necessarily true forceRefetch() forceRefetch: if (datum.fetchInterval maxInterval) // true because it's 1.35 * maxInterval datum.fetchInterval = 0.9 * maxInterval datum.status = db_unfetched // shouldFetch (called from generate / Generator.map): if ((datum.fetchTime - curTime) maxInterval) // always true if the crawler is launched in short intervals // (lower than 0.35 * maxInterval) datum.fetchTime = curTime // forces a refetch {code} After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days). Although the fetch time in the CrawlDb is far in the future {noformat} % nutch readdb testcrawl/crawldb -url http://localhost/page_gone URL: http://localhost/page_gone Version: 7 Status: 1 (db_unfetched) Fetch time: Sun May 06 05:20:05 CEST 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Score: 1.0 Signature: null Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} the URL is generated again because (fetch time - current time) is larger than db.fetch.interval.max. The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and the fetch time is always close to current time + 1.35 * db.fetch.interval.max. It's possibly a side effect of NUTCH-516,
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13511429#comment-13511429 ] Markus Jelsma commented on NUTCH-1245: -- Sebastian, i'm seeing good results with this patch for now with a low db.fetch.interval.max (60 days). URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again Key: NUTCH-1245 URL: https://issues.apache.org/jira/browse/NUTCH-1245 Project: Nutch Issue Type: Bug Affects Versions: 1.4, 1.5 Reporter: Sebastian Nagel Priority: Critical Fix For: 1.6 Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch A document gone with 404 after db.fetch.interval.max (90 days) has passed is fetched over and over again but although fetch status is fetch_gone its status in CrawlDb keeps db_unfetched. Consequently, this document will be generated and fetched from now on in every cycle. To reproduce: # create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this) # now this URL is fetched again # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days) # this does not change with every generate-fetch-update cycle, here for two segments: {noformat} /tmp/testcrawl/segments/20120105161430 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:14:21 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:14:48 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone /tmp/testcrawl/segments/20120105161631 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:16:23 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:20:05 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code: {code} setPageGoneSchedule (called from update / CrawlDbReducer.reduce): datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 if (maxInterval datum.fetchInterval) // necessarily true forceRefetch() forceRefetch: if (datum.fetchInterval maxInterval) // true because it's 1.35 * maxInterval datum.fetchInterval = 0.9 * maxInterval datum.status = db_unfetched // shouldFetch (called from generate / Generator.map): if ((datum.fetchTime - curTime) maxInterval) // always true if the crawler is launched in short intervals // (lower than 0.35 * maxInterval) datum.fetchTime = curTime // forces a refetch {code} After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days). Although the fetch time in the CrawlDb is far in the future {noformat} % nutch readdb testcrawl/crawldb -url http://localhost/page_gone URL: http://localhost/page_gone Version: 7 Status: 1 (db_unfetched) Fetch time: Sun May 06 05:20:05 CEST 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Score: 1.0 Signature: null Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} the URL is generated again because (fetch time - current time) is larger than db.fetch.interval.max. The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and the fetch time is always close to current time + 1.35 * db.fetch.interval.max. It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578 -- This message is
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13511539#comment-13511539 ] kiran commented on NUTCH-1245: -- Can 2.x version be affected by the same issue ? I am crawling a website and after some crawls it said db_unfetched is 960. Even though i crawled 10 times, the db_unfetched remained same. Any inputs ? URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again Key: NUTCH-1245 URL: https://issues.apache.org/jira/browse/NUTCH-1245 Project: Nutch Issue Type: Bug Affects Versions: 1.4, 1.5 Reporter: Sebastian Nagel Priority: Critical Fix For: 1.7 Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch A document gone with 404 after db.fetch.interval.max (90 days) has passed is fetched over and over again but although fetch status is fetch_gone its status in CrawlDb keeps db_unfetched. Consequently, this document will be generated and fetched from now on in every cycle. To reproduce: # create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this) # now this URL is fetched again # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days) # this does not change with every generate-fetch-update cycle, here for two segments: {noformat} /tmp/testcrawl/segments/20120105161430 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:14:21 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:14:48 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone /tmp/testcrawl/segments/20120105161631 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:16:23 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:20:05 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code: {code} setPageGoneSchedule (called from update / CrawlDbReducer.reduce): datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 if (maxInterval datum.fetchInterval) // necessarily true forceRefetch() forceRefetch: if (datum.fetchInterval maxInterval) // true because it's 1.35 * maxInterval datum.fetchInterval = 0.9 * maxInterval datum.status = db_unfetched // shouldFetch (called from generate / Generator.map): if ((datum.fetchTime - curTime) maxInterval) // always true if the crawler is launched in short intervals // (lower than 0.35 * maxInterval) datum.fetchTime = curTime // forces a refetch {code} After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days). Although the fetch time in the CrawlDb is far in the future {noformat} % nutch readdb testcrawl/crawldb -url http://localhost/page_gone URL: http://localhost/page_gone Version: 7 Status: 1 (db_unfetched) Fetch time: Sun May 06 05:20:05 CEST 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Score: 1.0 Signature: null Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} the URL is generated again because (fetch time - current time) is larger than db.fetch.interval.max. The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and the fetch time is always close to current time + 1.35 * db.fetch.interval.max. It's possibly a
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13525439#comment-13525439 ] Sebastian Nagel commented on NUTCH-1245: @kiran: yes, 2.x is affected since fetch schedulers do not differ (much) between 1.x and 2.x. However, with default settings you need a couple of month continuously crawling to run into this problem. @Markus: good news! Pulled the test out to NUTCH-1502 (broader coverage, need more time). Are there objections regarding the proposed patch? URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again Key: NUTCH-1245 URL: https://issues.apache.org/jira/browse/NUTCH-1245 Project: Nutch Issue Type: Bug Affects Versions: 1.4, 1.5 Reporter: Sebastian Nagel Priority: Critical Fix For: 1.7 Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch A document gone with 404 after db.fetch.interval.max (90 days) has passed is fetched over and over again but although fetch status is fetch_gone its status in CrawlDb keeps db_unfetched. Consequently, this document will be generated and fetched from now on in every cycle. To reproduce: # create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this) # now this URL is fetched again # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days) # this does not change with every generate-fetch-update cycle, here for two segments: {noformat} /tmp/testcrawl/segments/20120105161430 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:14:21 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:14:48 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone /tmp/testcrawl/segments/20120105161631 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:16:23 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:20:05 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code: {code} setPageGoneSchedule (called from update / CrawlDbReducer.reduce): datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 if (maxInterval datum.fetchInterval) // necessarily true forceRefetch() forceRefetch: if (datum.fetchInterval maxInterval) // true because it's 1.35 * maxInterval datum.fetchInterval = 0.9 * maxInterval datum.status = db_unfetched // shouldFetch (called from generate / Generator.map): if ((datum.fetchTime - curTime) maxInterval) // always true if the crawler is launched in short intervals // (lower than 0.35 * maxInterval) datum.fetchTime = curTime // forces a refetch {code} After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days). Although the fetch time in the CrawlDb is far in the future {noformat} % nutch readdb testcrawl/crawldb -url http://localhost/page_gone URL: http://localhost/page_gone Version: 7 Status: 1 (db_unfetched) Fetch time: Sun May 06 05:20:05 CEST 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Score: 1.0 Signature: null Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} the URL is generated again because (fetch time - current time) is larger than db.fetch.interval.max. The retry
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13525804#comment-13525804 ] Markus Jelsma commented on NUTCH-1245: -- No objections so far. I've put this in production monday after i had it baked for two weeks orso in a test environment. I'm still keeping an eye on it and only good results so far but i'd like to see it running a bit longer in case there's some edge-case lurking around - although i doubt it :) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again Key: NUTCH-1245 URL: https://issues.apache.org/jira/browse/NUTCH-1245 Project: Nutch Issue Type: Bug Affects Versions: 1.4, 1.5 Reporter: Sebastian Nagel Priority: Critical Fix For: 1.7 Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch A document gone with 404 after db.fetch.interval.max (90 days) has passed is fetched over and over again but although fetch status is fetch_gone its status in CrawlDb keeps db_unfetched. Consequently, this document will be generated and fetched from now on in every cycle. To reproduce: # create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this) # now this URL is fetched again # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days) # this does not change with every generate-fetch-update cycle, here for two segments: {noformat} /tmp/testcrawl/segments/20120105161430 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:14:21 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:14:48 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone /tmp/testcrawl/segments/20120105161631 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:16:23 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:20:05 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code: {code} setPageGoneSchedule (called from update / CrawlDbReducer.reduce): datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 if (maxInterval datum.fetchInterval) // necessarily true forceRefetch() forceRefetch: if (datum.fetchInterval maxInterval) // true because it's 1.35 * maxInterval datum.fetchInterval = 0.9 * maxInterval datum.status = db_unfetched // shouldFetch (called from generate / Generator.map): if ((datum.fetchTime - curTime) maxInterval) // always true if the crawler is launched in short intervals // (lower than 0.35 * maxInterval) datum.fetchTime = curTime // forces a refetch {code} After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days). Although the fetch time in the CrawlDb is far in the future {noformat} % nutch readdb testcrawl/crawldb -url http://localhost/page_gone URL: http://localhost/page_gone Version: 7 Status: 1 (db_unfetched) Fetch time: Sun May 06 05:20:05 CEST 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Score: 1.0 Signature: null Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} the URL is generated again because (fetch time - current time) is larger than db.fetch.interval.max. The retry interval (datum.fetchInterval) oscillates between 0.9 and
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493098#comment-13493098 ] Markus Jelsma commented on NUTCH-1245: -- Thanks for the thorough unit tests, they clearly show there's a problem to be solved. I think i agree on the proposed fix you mention for 1245, it makes some sense. Not calling forceRefetch (it only leads to more transient errors) but setting fetch time to max interval to see again later sounds what one could expect. On 578 and 1247, i think if we solve 578 overflowing may not be a big problem anymore. With Nutch as it works today it takes at least 128 days for it to overflow, if we fix it and people use a more reasonable max interval (say 30 days or higher) it'll overflow 10 years from now, which i think is reasonable. I'm not yet sure about the fix on 578. It's complex indeed ;) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again Key: NUTCH-1245 URL: https://issues.apache.org/jira/browse/NUTCH-1245 Project: Nutch Issue Type: Bug Affects Versions: 1.4, 1.5 Reporter: Sebastian Nagel Priority: Critical Fix For: 1.6 Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch A document gone with 404 after db.fetch.interval.max (90 days) has passed is fetched over and over again but although fetch status is fetch_gone its status in CrawlDb keeps db_unfetched. Consequently, this document will be generated and fetched from now on in every cycle. To reproduce: # create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this) # now this URL is fetched again # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days) # this does not change with every generate-fetch-update cycle, here for two segments: {noformat} /tmp/testcrawl/segments/20120105161430 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:14:21 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:14:48 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone /tmp/testcrawl/segments/20120105161631 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:16:23 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:20:05 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code: {code} setPageGoneSchedule (called from update / CrawlDbReducer.reduce): datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 if (maxInterval datum.fetchInterval) // necessarily true forceRefetch() forceRefetch: if (datum.fetchInterval maxInterval) // true because it's 1.35 * maxInterval datum.fetchInterval = 0.9 * maxInterval datum.status = db_unfetched // shouldFetch (called from generate / Generator.map): if ((datum.fetchTime - curTime) maxInterval) // always true if the crawler is launched in short intervals // (lower than 0.35 * maxInterval) datum.fetchTime = curTime // forces a refetch {code} After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days). Although the fetch time in the CrawlDb is far in the future {noformat} % nutch readdb testcrawl/crawldb -url http://localhost/page_gone URL: http://localhost/page_gone Version: 7 Status: 1 (db_unfetched) Fetch time: Sun May 06 05:20:05 CEST 2012
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13488864#comment-13488864 ] Markus Jelsma commented on NUTCH-1245: -- Sebastian, very interesting! Can you close either this issue or NUTCH-578? I'll hope to check your patches soon! URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again Key: NUTCH-1245 URL: https://issues.apache.org/jira/browse/NUTCH-1245 Project: Nutch Issue Type: Bug Affects Versions: 1.4, 1.5 Reporter: Sebastian Nagel Priority: Critical Fix For: 1.6 Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch A document gone with 404 after db.fetch.interval.max (90 days) has passed is fetched over and over again but although fetch status is fetch_gone its status in CrawlDb keeps db_unfetched. Consequently, this document will be generated and fetched from now on in every cycle. To reproduce: # create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this) # now this URL is fetched again # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days) # this does not change with every generate-fetch-update cycle, here for two segments: {noformat} /tmp/testcrawl/segments/20120105161430 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:14:21 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:14:48 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone /tmp/testcrawl/segments/20120105161631 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:16:23 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:20:05 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code: {code} setPageGoneSchedule (called from update / CrawlDbReducer.reduce): datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 if (maxInterval datum.fetchInterval) // necessarily true forceRefetch() forceRefetch: if (datum.fetchInterval maxInterval) // true because it's 1.35 * maxInterval datum.fetchInterval = 0.9 * maxInterval datum.status = db_unfetched // shouldFetch (called from generate / Generator.map): if ((datum.fetchTime - curTime) maxInterval) // always true if the crawler is launched in short intervals // (lower than 0.35 * maxInterval) datum.fetchTime = curTime // forces a refetch {code} After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days). Although the fetch time in the CrawlDb is far in the future {noformat} % nutch readdb testcrawl/crawldb -url http://localhost/page_gone URL: http://localhost/page_gone Version: 7 Status: 1 (db_unfetched) Fetch time: Sun May 06 05:20:05 CEST 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Score: 1.0 Signature: null Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} the URL is generated again because (fetch time - current time) is larger than db.fetch.interval.max. The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and the fetch time is always close to current time + 1.35 * db.fetch.interval.max. It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578 -- This message
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13488935#comment-13488935 ] Sebastian Nagel commented on NUTCH-1245: They are not duplicates but the effects are similar: NUTCH-1245 - caused by calling forceRefetch just after a fetch leads to a fetch_gone. If the fetchInterval is close to db.fetch.interval.max, setPageGoneSchedule calls forceRefetch. That's useless since we got a 404 right now (or within the last day(s) for large crawls). - proposed fix: setPageGoneSchedule should not call forceRefetch but keep the fetchInterval within/below db.fetch.interval.max NUTCH-578 - although the status of a page fetched 3 times (db.fetch.retry.max) with a transient error (fetch_retry) is set to db_gone, the fetchInterval is still only incremented by one day. So next day this page is fetched again. - every fetch_retry still increments the retry counter so that it may overflow (NUTCH-1247) - fix: -* call setPageGoneSchedule in CrawlDbReducer.reduce when retry counter is hit and status is set to db_gone. All patches (by various users/committers) agree in this: it will set the fetchInterval to a value larger than one day, so that from now on the URL is not fetched again and again. -* reset the retry counter to 0 or prohibit an overflow. I'm not sure what the best solution is, see comments on NUTCH-578 Markus, would be great if you start with a look on the JUnit patch. It has two aims: catch the error and make analysis easier (it logs a lot). I would like to extend the test to other CrawlDatum state transitions: these are complex for continuous crawls in combination with retry counters, intervals, signatures, etc. An exhaustive test could ensure that we do not break other state transitions. URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again Key: NUTCH-1245 URL: https://issues.apache.org/jira/browse/NUTCH-1245 Project: Nutch Issue Type: Bug Affects Versions: 1.4, 1.5 Reporter: Sebastian Nagel Priority: Critical Fix For: 1.6 Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch A document gone with 404 after db.fetch.interval.max (90 days) has passed is fetched over and over again but although fetch status is fetch_gone its status in CrawlDb keeps db_unfetched. Consequently, this document will be generated and fetched from now on in every cycle. To reproduce: # create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this) # now this URL is fetched again # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days) # this does not change with every generate-fetch-update cycle, here for two segments: {noformat} /tmp/testcrawl/segments/20120105161430 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:14:21 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:14:48 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone /tmp/testcrawl/segments/20120105161631 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:16:23 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:20:05 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code: {code} setPageGoneSchedule (called from update / CrawlDbReducer.reduce): datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 if (maxInterval
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13456928#comment-13456928 ] Markus Jelsma commented on NUTCH-1245: -- Any ideas on this issue? URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again Key: NUTCH-1245 URL: https://issues.apache.org/jira/browse/NUTCH-1245 Project: Nutch Issue Type: Bug Affects Versions: 1.4, 1.5 Reporter: Sebastian Nagel Priority: Critical Fix For: 1.6 A document gone with 404 after db.fetch.interval.max (90 days) has passed is fetched over and over again but although fetch status is fetch_gone its status in CrawlDb keeps db_unfetched. Consequently, this document will be generated and fetched from now on in every cycle. To reproduce: # create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this) # now this URL is fetched again # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days) # this does not change with every generate-fetch-update cycle, here for two segments: {noformat} /tmp/testcrawl/segments/20120105161430 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:14:21 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:14:48 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone /tmp/testcrawl/segments/20120105161631 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:16:23 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:20:05 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code: {code} setPageGoneSchedule (called from update / CrawlDbReducer.reduce): datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 if (maxInterval datum.fetchInterval) // necessarily true forceRefetch() forceRefetch: if (datum.fetchInterval maxInterval) // true because it's 1.35 * maxInterval datum.fetchInterval = 0.9 * maxInterval datum.status = db_unfetched // shouldFetch (called from generate / Generator.map): if ((datum.fetchTime - curTime) maxInterval) // always true if the crawler is launched in short intervals // (lower than 0.35 * maxInterval) datum.fetchTime = curTime // forces a refetch {code} After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days). Although the fetch time in the CrawlDb is far in the future {noformat} % nutch readdb testcrawl/crawldb -url http://localhost/page_gone URL: http://localhost/page_gone Version: 7 Status: 1 (db_unfetched) Fetch time: Sun May 06 05:20:05 CEST 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Score: 1.0 Signature: null Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} the URL is generated again because (fetch time - current time) is larger than db.fetch.interval.max. The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and the fetch time is always close to current time + 1.35 * db.fetch.interval.max. It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13192215#comment-13192215 ] Sebastian Nagel commented on NUTCH-1245: There are several possibilities to get a CrawlDatum with a fetchInterval large enough that a multiplication by 1.5 will exceed maxInterval: # with adaptive fetch scheduling ## db.fetch.schedule.adaptive.min_interval (1.5 * db.fetch.interval.max) -- as José commented some kind of misconfiguration ## after some time when the document didn't change and db.fetch.schedule.adaptive.max_interval (1.5 * db.fetch.interval.max) -- but this is the default (1 year 1.5*90 days)! # db.fetch.interval.default (1.5 * db.fetch.interval.max) -- again some kind of misconfiguration # also setPageGoneSchedule increases the fetchInterval every time it is called, so after a gone page is tried to re-fetch several times we run into the same situation Anyway, I think also the misconfigurations should be made impossible. URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again Key: NUTCH-1245 URL: https://issues.apache.org/jira/browse/NUTCH-1245 Project: Nutch Issue Type: Bug Affects Versions: 1.4, 1.5 Reporter: Sebastian Nagel Fix For: 1.5 A document gone with 404 after db.fetch.interval.max (90 days) has passed is fetched over and over again but although fetch status is fetch_gone its status in CrawlDb keeps db_unfetched. Consequently, this document will be generated and fetched from now on in every cycle. To reproduce: # create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this) # now this URL is fetched again # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days) # this does not change with every generate-fetch-update cycle, here for two segments: {noformat} /tmp/testcrawl/segments/20120105161430 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:14:21 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:14:48 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone /tmp/testcrawl/segments/20120105161631 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:16:23 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:20:05 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code: {code} setPageGoneSchedule (called from update / CrawlDbReducer.reduce): datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 if (maxInterval datum.fetchInterval) // necessarily true forceRefetch() forceRefetch: if (datum.fetchInterval maxInterval) // true because it's 1.35 * maxInterval datum.fetchInterval = 0.9 * maxInterval datum.status = db_unfetched // shouldFetch (called from generate / Generator.map): if ((datum.fetchTime - curTime) maxInterval) // always true if the crawler is launched in short intervals // (lower than 0.35 * maxInterval) datum.fetchTime = curTime // forces a refetch {code} After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days). Although the fetch time in the CrawlDb is far in the future {noformat} % nutch readdb testcrawl/crawldb -url http://localhost/page_gone URL: http://localhost/page_gone Version: 7 Status: 1 (db_unfetched) Fetch time: Sun May 06 05:20:05 CEST 2012 Modified time: Thu Jan 01
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183292#comment-13183292 ] José Gil commented on NUTCH-1245: - FWIW - We experienced a similar problem (entries in CrawlDB marked as db_fetched in spite of having resulted in a 404 response) and we eventually traced it to a configuration problem: our db.fetch.schedule.adaptive.min_interval was larger than our db.fetch.interval.max. After reducing the first and increasing the second, the entries are marked as db_gone as expected. URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again Key: NUTCH-1245 URL: https://issues.apache.org/jira/browse/NUTCH-1245 Project: Nutch Issue Type: Bug Affects Versions: 1.4, 1.5 Reporter: Sebastian Nagel Fix For: 1.5 A document gone with 404 after db.fetch.interval.max (90 days) has passed is fetched over and over again but although fetch status is fetch_gone its status in CrawlDb keeps db_unfetched. Consequently, this document will be generated and fetched from now on in every cycle. To reproduce: # create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this) # now this URL is fetched again # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days) # this does not change with every generate-fetch-update cycle, here for two segments: {noformat} /tmp/testcrawl/segments/20120105161430 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:14:21 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:14:48 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone /tmp/testcrawl/segments/20120105161631 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:16:23 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:20:05 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code: {code} setPageGoneSchedule (called from update / CrawlDbReducer.reduce): datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 if (maxInterval datum.fetchInterval) // necessarily true forceRefetch() forceRefetch: if (datum.fetchInterval maxInterval) // true because it's 1.35 * maxInterval datum.fetchInterval = 0.9 * maxInterval datum.status = db_unfetched // shouldFetch (called from generate / Generator.map): if ((datum.fetchTime - curTime) maxInterval) // always true if the crawler is launched in short intervals // (lower than 0.35 * maxInterval) datum.fetchTime = curTime // forces a refetch {code} After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days). Although the fetch time in the CrawlDb is far in the future {noformat} % nutch readdb testcrawl/crawldb -url http://localhost/page_gone URL: http://localhost/page_gone Version: 7 Status: 1 (db_unfetched) Fetch time: Sun May 06 05:20:05 CEST 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Score: 1.0 Signature: null Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} the URL is generated again because (fetch time - current time) is larger than db.fetch.interval.max. The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and the fetch time is always close to current time + 1.35 *
[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13180515#comment-13180515 ] Markus Jelsma commented on NUTCH-1245: -- Thanks! This must be the same issue as NUTCH-578 but marked as related for now. Can you provide a patch? URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again Key: NUTCH-1245 URL: https://issues.apache.org/jira/browse/NUTCH-1245 Project: Nutch Issue Type: Bug Affects Versions: 1.4, 1.5 Reporter: Sebastian Nagel A document gone with 404 after db.fetch.interval.max (90 days) has passed is fetched over and over again but although fetch status is fetch_gone its status in CrawlDb keeps db_unfetched. Consequently, this document will be generated and fetched from now on in every cycle. To reproduce: # create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this) # now this URL is fetched again # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days) # this does not change with every generate-fetch-update cycle, here for two segments: {noformat} /tmp/testcrawl/segments/20120105161430 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:14:21 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:14:48 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone /tmp/testcrawl/segments/20120105161631 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:16:23 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:20:05 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code: {code} setPageGoneSchedule (called from update / CrawlDbReducer.reduce): datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 if (maxInterval datum.fetchInterval) // necessarily true forceRefetch() forceRefetch: if (datum.fetchInterval maxInterval) // true because it's 1.35 * maxInterval datum.fetchInterval = 0.9 * maxInterval datum.status = db_unfetched // shouldFetch (called from generate / Generator.map): if ((datum.fetchTime - curTime) maxInterval) // always true if the crawler is launched in short intervals // (lower than 0.35 * maxInterval) datum.fetchTime = curTime // forces a refetch {code} After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days). Although the fetch time in the CrawlDb is far in the future {noformat} % nutch readdb testcrawl/crawldb -url http://localhost/page_gone URL: http://localhost/page_gone Version: 7 Status: 1 (db_unfetched) Fetch time: Sun May 06 05:20:05 CEST 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Score: 1.0 Signature: null Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} the URL is generated again because (fetch time - current time) is larger than db.fetch.interval.max. The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and the fetch time is always close to current time + 1.35 * db.fetch.interval.max. It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa