subject:"\[jira\] \[Updated\] \(NUTCH\-1245\) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again"

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2013-06-20 Thread Markus Jelsma (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1245:
-

Fix Version/s: (was: 1.8)
   1.7

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.7
>
> Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, 
> NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.interval.max.
> The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
> the fetch time is always close to current time + 1.35 * db.fetch.interval.max.
> It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578

--
This message is automatical

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-11-22 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1245:


Patch Info: Patch Available

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.6
>
> Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, 
> NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.interval.max.
> The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
> the fetch time is always close to current time + 1.35 * db.fetch.interval.max.
> It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578

--
This message is automatically generate

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1245:
---

Attachment: NUTCH-1245-2.patch
NUTCH-1245-578-TEST-2.patch

Improved patches

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.6
>
> Attachments: NUTCH-1245-1.patch, NUTCH-1245-2.patch, 
> NUTCH-1245-578-TEST-1.patch, NUTCH-1245-578-TEST-2.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.interval.max.
> The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
> the fetch time is always close to current time + 1.35 * db.fetch.interval.max.
> It's possibly a side effect of NUTCH-516, and may

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1245:
---

Attachment: NUTCH-1245-1.patch

FetchSchedule.setPageGoneSchedule is called exclusively for a fetch_gone in 
CrawlDbReducer.reduce. Is there a need to call forceRefetch just after a fetch 
leads to a fetch_gone (assumed there is little delay between fetch and 
updatedb)?

Attached patch sets the fetchInterval to db.fetch.interval.max and does not 
call forceRefetch.

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.6
>
> Attachments: NUTCH-1245-1.patch, NUTCH-1245-578-TEST-1.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.int

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-10-29 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1245:
---

Attachment: NUTCH-1245-578-TEST-1.patch

JUnit test to catch this problem and NUTCH-578: a large patch for a test but 
the idea is to extend it to test also other transitions of CrawlDatum states.

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.6
>
> Attachments: NUTCH-1245-578-TEST-1.patch
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.interval.max.
> The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
> the fetch time is always close to current time + 1.35 * db.fetch.interval.max.
> It's possibly a s

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1245:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.6
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.interval.max.
> The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
> the fetch time is always close to current time + 1.35 * db.fetch.interval.max.
> It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administr

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-01-30 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1245:
-

Priority: Critical  (was: Major)

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.5
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.interval.max.
> The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
> the fetch time is always close to current time + 1.35 * db.fetch.interval.max.
> It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/Conta

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-01-06 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1245:
-

Fix Version/s: 1.5

> URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
> and is generated over and over again
> 
>
> Key: NUTCH-1245
> URL: https://issues.apache.org/jira/browse/NUTCH-1245
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4, 1.5
>Reporter: Sebastian Nagel
> Fix For: 1.5
>
>
> A document gone with 404 after db.fetch.interval.max (90 days) has passed
> is fetched over and over again but although fetch status is fetch_gone
> its status in CrawlDb keeps db_unfetched. Consequently, this document will
> be generated and fetched from now on in every cycle.
> To reproduce:
> # create a CrawlDatum in CrawlDb which retry interval hits 
> db.fetch.interval.max (I manipulated the shouldFetch() in 
> AbstractFetchSchedule to achieve this)
> # now this URL is fetched again
> # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
> db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
> days)
> # this does not change with every generate-fetch-update cycle, here for two 
> segments:
> {noformat}
> /tmp/testcrawl/segments/20120105161430
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:14:21 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:14:48 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> /tmp/testcrawl/segments/20120105161631
> SegmentReader: get 'http://localhost/page_gone'
> Crawl Generate::
> Status: 1 (db_unfetched)
> Fetch time: Thu Jan 05 16:16:23 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> Crawl Fetch::
> Status: 37 (fetch_gone)
> Fetch time: Thu Jan 05 16:20:05 CET 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
> http://localhost/page_gone
> {noformat}
> As far as I can see it's caused by setPageGoneSchedule() in 
> AbstractFetchSchedule. Some pseudo-code:
> {code}
> setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
> datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
> maxInterval
> datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
> if (maxInterval < datum.fetchInterval) // necessarily true
>forceRefetch()
> forceRefetch:
> if (datum.fetchInterval > maxInterval) // true because it's 1.35 * 
> maxInterval
>datum.fetchInterval = 0.9 * maxInterval
> datum.status = db_unfetched // 
> shouldFetch (called from generate / Generator.map):
> if ((datum.fetchTime - curTime) > maxInterval)
>// always true if the crawler is launched in short intervals
>// (lower than 0.35 * maxInterval)
>datum.fetchTime = curTime // forces a refetch
> {code}
> After setPageGoneSchedule is called via update the state is db_unfetched and 
> the retry interval 0.9 * db.fetch.interval.max (81 days). 
> Although the fetch time in the CrawlDb is far in the future
> {noformat}
> % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
> URL: http://localhost/page_gone
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Sun May 06 05:20:05 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 6998400 seconds (81 days)
> Score: 1.0
> Signature: null
> Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
> {noformat}
> the URL is generated again because (fetch time - current time) is larger than 
> db.fetch.interval.max.
> The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
> the fetch time is always close to current time + 1.35 * db.fetch.interval.max.
> It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more informa

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

8 matches

Site Navigation

Mail list logo

Footer information