[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1245: - Fix Version/s: (was: 1.8) 1.7 > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.7 > > Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, > NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is generated again because (fetch time - current time) is larger than > db.fetch.interval.max. > The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and > the fetch time is always close to current time + 1.35 * db.fetch.interval.max. > It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578 -- This message is automatical
[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1245: Patch Info: Patch Available > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.6 > > Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, > NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is generated again because (fetch time - current time) is larger than > db.fetch.interval.max. > The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and > the fetch time is always close to current time + 1.35 * db.fetch.interval.max. > It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578 -- This message is automatically generate
[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1245: --- Attachment: NUTCH-1245-2.patch NUTCH-1245-578-TEST-2.patch Improved patches > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.6 > > Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, > NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is generated again because (fetch time - current time) is larger than > db.fetch.interval.max. > The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and > the fetch time is always close to current time + 1.35 * db.fetch.interval.max. > It's possibly a side effect of NUTCH-516, and may
[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1245: --- Attachment: NUTCH-1245-1.patch FetchSchedule.setPageGoneSchedule is called exclusively for a fetch_gone in CrawlDbReducer.reduce. Is there a need to call forceRefetch just after a fetch leads to a fetch_gone (assumed there is little delay between fetch and updatedb)? Attached patch sets the fetchInterval to db.fetch.interval.max and does not call forceRefetch. > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.6 > > Attachments: NUTCH-1245-1.patch, NUTCH-1245-578-TEST-1.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is generated again because (fetch time - current time) is larger than > db.fetch.int
[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1245: --- Attachment: NUTCH-1245-578-TEST-1.patch JUnit test to catch this problem and NUTCH-578: a large patch for a test but the idea is to extend it to test also other transitions of CrawlDatum states. > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.6 > > Attachments: NUTCH-1245-578-TEST-1.patch > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is generated again because (fetch time - current time) is larger than > db.fetch.interval.max. > The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and > the fetch time is always close to current time + 1.35 * db.fetch.interval.max. > It's possibly a s
[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1245: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.6 > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is generated again because (fetch time - current time) is larger than > db.fetch.interval.max. > The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and > the fetch time is always close to current time + 1.35 * db.fetch.interval.max. > It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administr
[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1245: - Priority: Critical (was: Major) > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.5 > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is generated again because (fetch time - current time) is larger than > db.fetch.interval.max. > The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and > the fetch time is always close to current time + 1.35 * db.fetch.interval.max. > It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/Conta
[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1245: - Fix Version/s: 1.5 > URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb > and is generated over and over again > > > Key: NUTCH-1245 > URL: https://issues.apache.org/jira/browse/NUTCH-1245 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4, 1.5 >Reporter: Sebastian Nagel > Fix For: 1.5 > > > A document gone with 404 after db.fetch.interval.max (90 days) has passed > is fetched over and over again but although fetch status is fetch_gone > its status in CrawlDb keeps db_unfetched. Consequently, this document will > be generated and fetched from now on in every cycle. > To reproduce: > # create a CrawlDatum in CrawlDb which retry interval hits > db.fetch.interval.max (I manipulated the shouldFetch() in > AbstractFetchSchedule to achieve this) > # now this URL is fetched again > # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to > db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 > days) > # this does not change with every generate-fetch-update cycle, here for two > segments: > {noformat} > /tmp/testcrawl/segments/20120105161430 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:14:21 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:14:48 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > /tmp/testcrawl/segments/20120105161631 > SegmentReader: get 'http://localhost/page_gone' > Crawl Generate:: > Status: 1 (db_unfetched) > Fetch time: Thu Jan 05 16:16:23 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > Crawl Fetch:: > Status: 37 (fetch_gone) > Fetch time: Thu Jan 05 16:20:05 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: > http://localhost/page_gone > {noformat} > As far as I can see it's caused by setPageGoneSchedule() in > AbstractFetchSchedule. Some pseudo-code: > {code} > setPageGoneSchedule (called from update / CrawlDbReducer.reduce): > datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * > maxInterval > datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 > if (maxInterval < datum.fetchInterval) // necessarily true >forceRefetch() > forceRefetch: > if (datum.fetchInterval > maxInterval) // true because it's 1.35 * > maxInterval >datum.fetchInterval = 0.9 * maxInterval > datum.status = db_unfetched // > shouldFetch (called from generate / Generator.map): > if ((datum.fetchTime - curTime) > maxInterval) >// always true if the crawler is launched in short intervals >// (lower than 0.35 * maxInterval) >datum.fetchTime = curTime // forces a refetch > {code} > After setPageGoneSchedule is called via update the state is db_unfetched and > the retry interval 0.9 * db.fetch.interval.max (81 days). > Although the fetch time in the CrawlDb is far in the future > {noformat} > % nutch readdb testcrawl/crawldb -url http://localhost/page_gone > URL: http://localhost/page_gone > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun May 06 05:20:05 CEST 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 6998400 seconds (81 days) > Score: 1.0 > Signature: null > Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone > {noformat} > the URL is generated again because (fetch time - current time) is larger than > db.fetch.interval.max. > The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and > the fetch time is always close to current time + 1.35 * db.fetch.interval.max. > It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more informa