i'm observing crawl dates, which have fetch interval with value 0. when i dump the segment, i see
Recno:: 33 URL:: http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/ CrawlDatum:: Version: 7 Status: 65 (signature) Fetch time: Tue Dec 01 23:41:15 CET 2009 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 0 seconds (0 days) Score: 1.0 Signature: 1d63c4283a5e0c7b8eb8dee359adfabe Metadata: CrawlDatum:: Version: 7 Status: 33 (fetch_success) Fetch time: Tue Dec 01 23:38:48 CET 2009 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 0 seconds (0 days) Score: 1.0 Signature: null Metadata: this crawl date is parsed/generated from a feed. http://www.wachauclimbing.net/home/impressum-disclaimer/feed when and where should the fetch interval be set? when parsing or when updating the crawl db? this is the code i suspect in ParseOutputFormat to generate the crawl date if (!parse.isCanonical()) { CrawlDatum datum = new CrawlDatum(); datum.setStatus(CrawlDatum.STATUS_FETCH_SUCCESS); String timeString = parse.getData().getContentMeta().get( Nutch.FETCH_TIME_KEY); try { datum.setFetchTime(Long.parseLong(timeString)); } catch (Exception e) { LOG.warn("Can't read fetch time for: " + key); datum.setFetchTime(System.currentTimeMillis()); } crawlOut.append(key, datum); } i assume, the fetch interval should be set in CrawlDbReducer // set the schedule result = schedule.setFetchSchedule((Text) key, result, prevFetchTime, prevModifiedTime, fetch.getFetchTime(), fetch.getModifiedTime(), modified); if ( result.getFetchInterval() == 0 ) { LOG.warn( "WARNING: FETCH INTERVAL is 0 for " + key); } here i observe the 0. i propose to check in DefaultFetchSchedule for 0 fetch interval or in AbstractFetchSchedule. public class DefaultFetchSchedule extends AbstractFetchSchedule { @Override public CrawlDatum setFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state) { datum = super.setFetchSchedule(url, datum, prevFetchTime, prevModifiedTime, fetchTime, modifiedTime, state); datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000); datum.setModifiedTime(modifiedTime); return datum; regards reinhard
