i have tested this now with the current trunk of nutch. Revision: 886112 the dump of the crawl db shows
http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/ Version: 7 Status: 2 (db_fetched) Fetch time: Wed Dec 02 12:48:22 CET 2009 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 0 seconds (0 days) Score: 1.0833334 Signature: db9ab2193924cd2d0b53113a500ca604 Metadata: _pst_: success(1), lastModified=0 http://www.wachauclimbing.net/home/impressum-disclaimer/feed/ Version: 7 Status: 2 (db_fetched) Fetch time: Sun Jan 31 12:44:52 CET 2010 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 5184000 seconds (60 days) Score: 1.0166667 Signature: c409d31ddf24f01b262c19ac2e301671 Metadata: _pst_: success(1), lastModified=0_repr_: http://www.wachauclimbing.net/home/impressum-disclaimer/feed the other crawl dates have 60 days retry interval. this crawl date will be fetched and fetched again with 0 days retry interval. i will open an issue in jira and attach a patch. regards reinhard reinhard schwab schrieb: > i'm observing crawl dates, which have fetch interval with value 0. > when i dump the segment, i see > > Recno:: 33 > URL:: > http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/ > > CrawlDatum:: > Version: 7 > Status: 65 (signature) > Fetch time: Tue Dec 01 23:41:15 CET 2009 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 0 seconds (0 days) > Score: 1.0 > Signature: 1d63c4283a5e0c7b8eb8dee359adfabe > Metadata: > > CrawlDatum:: > Version: 7 > Status: 33 (fetch_success) > Fetch time: Tue Dec 01 23:38:48 CET 2009 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 0 seconds (0 days) > Score: 1.0 > Signature: null > Metadata: > > > this crawl date is parsed/generated from a feed. > http://www.wachauclimbing.net/home/impressum-disclaimer/feed > > when and where should the fetch interval be set? > when parsing or when updating the crawl db? > > this is the code i suspect in ParseOutputFormat to generate the crawl date > > if (!parse.isCanonical()) { > CrawlDatum datum = new CrawlDatum(); > datum.setStatus(CrawlDatum.STATUS_FETCH_SUCCESS); > String timeString = > parse.getData().getContentMeta().get( > Nutch.FETCH_TIME_KEY); > try { > datum.setFetchTime(Long.parseLong(timeString)); > } catch (Exception e) { > LOG.warn("Can't read fetch time for: " + key); > datum.setFetchTime(System.currentTimeMillis()); > } > crawlOut.append(key, datum); > } > > i assume, the fetch interval should be set in CrawlDbReducer > > // set the schedule > result = schedule.setFetchSchedule((Text) key, result, > prevFetchTime, prevModifiedTime, fetch.getFetchTime(), > fetch.getModifiedTime(), modified); > if ( result.getFetchInterval() == 0 ) { > LOG.warn( "WARNING: FETCH INTERVAL is 0 for " + key); > } > > here i observe the 0. > > i propose to check in DefaultFetchSchedule for 0 fetch interval or in > AbstractFetchSchedule. > > public class DefaultFetchSchedule extends AbstractFetchSchedule { > > @Override > public CrawlDatum setFetchSchedule(Text url, CrawlDatum datum, > long prevFetchTime, long prevModifiedTime, > long fetchTime, long modifiedTime, int state) { > datum = super.setFetchSchedule(url, datum, prevFetchTime, > prevModifiedTime, > fetchTime, modifiedTime, state); > datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000); > datum.setModifiedTime(modifiedTime); > return datum; > > regards > reinhard > > > >
