Re: crawl dates with fetch interval 0

reinhard schwab Wed, 02 Dec 2009 03:50:25 -0800

i have tested this now with the current trunk of nutch.
Revision: 886112

the dump of the crawl db shows


http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/
Version: 7
Status: 2 (db_fetched)
Fetch time: Wed Dec 02 12:48:22 CET 2009
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 1.0833334
Signature: db9ab2193924cd2d0b53113a500ca604
Metadata: _pst_: success(1), lastModified=0

http://www.wachauclimbing.net/home/impressum-disclaimer/feed/ Version: 7
Status: 2 (db_fetched)
Fetch time: Sun Jan 31 12:44:52 CET 2010
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 5184000 seconds (60 days)
Score: 1.0166667
Signature: c409d31ddf24f01b262c19ac2e301671
Metadata: _pst_: success(1), lastModified=0_repr_:
http://www.wachauclimbing.net/home/impressum-disclaimer/feed

the other crawl dates have 60 days retry interval.

this crawl date will be fetched and fetched again with 0 days retry
interval.

i will open an issue in jira and attach a patch.

regards
reinhard


reinhard schwab schrieb:
> i'm observing crawl dates, which have fetch interval with value 0.
> when i dump the segment, i see
>
> Recno:: 33
> URL::
> http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/
>
> CrawlDatum::
> Version: 7
> Status: 65 (signature)
> Fetch time: Tue Dec 01 23:41:15 CET 2009
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 0 seconds (0 days)
> Score: 1.0
> Signature: 1d63c4283a5e0c7b8eb8dee359adfabe
> Metadata:
>
> CrawlDatum::
> Version: 7
> Status: 33 (fetch_success)
> Fetch time: Tue Dec 01 23:38:48 CET 2009
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 0 seconds (0 days)
> Score: 1.0
> Signature: null
> Metadata:
>
>
> this crawl date is parsed/generated from a feed.
> http://www.wachauclimbing.net/home/impressum-disclaimer/feed
>
> when and where should the fetch interval be set?
> when parsing or when updating the crawl db?
>
> this is the code i suspect in ParseOutputFormat to generate the crawl date
>
>     if (!parse.isCanonical()) {
>                     CrawlDatum datum = new CrawlDatum();
>                     datum.setStatus(CrawlDatum.STATUS_FETCH_SUCCESS);
>                     String timeString =
> parse.getData().getContentMeta().get(
>                             Nutch.FETCH_TIME_KEY);
>                     try {
>                         datum.setFetchTime(Long.parseLong(timeString));
>                     } catch (Exception e) {
>                         LOG.warn("Can't read fetch time for: " + key);
>                         datum.setFetchTime(System.currentTimeMillis());
>                     }
>                     crawlOut.append(key, datum);
>                 }
>
> i assume, the fetch interval should be set in CrawlDbReducer
>
> // set the schedule
>             result = schedule.setFetchSchedule((Text) key, result,
>                     prevFetchTime, prevModifiedTime, fetch.getFetchTime(),
>                     fetch.getModifiedTime(), modified);
>             if ( result.getFetchInterval() == 0 ) {
>               LOG.warn( "WARNING: FETCH INTERVAL is 0 for " + key);
>             }
>
> here i observe the 0.
>
> i propose to check in DefaultFetchSchedule for 0 fetch interval or in
> AbstractFetchSchedule.
>
> public class DefaultFetchSchedule extends AbstractFetchSchedule {
>
>   @Override
>   public CrawlDatum setFetchSchedule(Text url, CrawlDatum datum,
>           long prevFetchTime, long prevModifiedTime,
>           long fetchTime, long modifiedTime, int state) {
>     datum = super.setFetchSchedule(url, datum, prevFetchTime,
> prevModifiedTime,
>         fetchTime, modifiedTime, state);
>     datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000);
>     datum.setModifiedTime(modifiedTime);
>     return datum;
>
> regards
> reinhard
>
>
>
>

Re: crawl dates with fetch interval 0

Reply via email to