i'm observing crawl dates, which have fetch interval with value 0.
when i dump the segment, i see

Recno:: 33
URL::
http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/

CrawlDatum::
Version: 7
Status: 65 (signature)
Fetch time: Tue Dec 01 23:41:15 CET 2009
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 1.0
Signature: 1d63c4283a5e0c7b8eb8dee359adfabe
Metadata:

CrawlDatum::
Version: 7
Status: 33 (fetch_success)
Fetch time: Tue Dec 01 23:38:48 CET 2009
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 1.0
Signature: null
Metadata:


this crawl date is parsed/generated from a feed.
http://www.wachauclimbing.net/home/impressum-disclaimer/feed

when and where should the fetch interval be set?
when parsing or when updating the crawl db?

this is the code i suspect in ParseOutputFormat to generate the crawl date

    if (!parse.isCanonical()) {
                    CrawlDatum datum = new CrawlDatum();
                    datum.setStatus(CrawlDatum.STATUS_FETCH_SUCCESS);
                    String timeString =
parse.getData().getContentMeta().get(
                            Nutch.FETCH_TIME_KEY);
                    try {
                        datum.setFetchTime(Long.parseLong(timeString));
                    } catch (Exception e) {
                        LOG.warn("Can't read fetch time for: " + key);
                        datum.setFetchTime(System.currentTimeMillis());
                    }
                    crawlOut.append(key, datum);
                }

i assume, the fetch interval should be set in CrawlDbReducer

// set the schedule
            result = schedule.setFetchSchedule((Text) key, result,
                    prevFetchTime, prevModifiedTime, fetch.getFetchTime(),
                    fetch.getModifiedTime(), modified);
            if ( result.getFetchInterval() == 0 ) {
              LOG.warn( "WARNING: FETCH INTERVAL is 0 for " + key);
            }

here i observe the 0.

i propose to check in DefaultFetchSchedule for 0 fetch interval or in
AbstractFetchSchedule.

public class DefaultFetchSchedule extends AbstractFetchSchedule {

  @Override
  public CrawlDatum setFetchSchedule(Text url, CrawlDatum datum,
          long prevFetchTime, long prevModifiedTime,
          long fetchTime, long modifiedTime, int state) {
    datum = super.setFetchSchedule(url, datum, prevFetchTime,
prevModifiedTime,
        fetchTime, modifiedTime, state);
    datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000);
    datum.setModifiedTime(modifiedTime);
    return datum;

regards
reinhard


Reply via email to