Re: [nutchgora] AbstractFetchSchedule.forceFetch method resets fetch status

Mathijs Homminga Tue, 28 Feb 2012 07:02:17 -0800

Yes, thanks. 
It is related. However, it does not apply to DB_GONE pages (only), but to all 
pages that have their fetchInterval > max interval.


Actually, I'm still a bit puzzled by the scheduling related parameters and the 
way the AbstractFetchSchedule handles them.
Why do pages with a fetchInterval > maxInterval suddenly have to be fetched?
I would say that if we encounter such pages, we correct the fetchInterval (set 
it to the maxInterval) and leave it there. Also, I would suggest that we only 
do this at DbUpdate time.

Mathijs









On Feb 28, 2012, at 14:41 , Markus Jelsma wrote:

> https://issues.apache.org/jira/browse/NUTCH-578
> https://issues.apache.org/jira/browse/NUTCH-1245
> 
> Is you issue similar to these?
> 
> On Tuesday 28 February 2012 14:09:25 Mathijs Homminga wrote:
>> Hi,
>> 
>> Does anyone know why the AbstractFetchSchedule.forceFetch method sets the
>> page.status to STATUS_UNFETCHED?
>> 
>> The DbUpdateReducer calls this method when the page.fetchInterval exceeds
>> the (current) db.fetch.interval.max. As I understand it, we call this
>> method to keep all fetchIntervals in the webtable within the current
>> maximum, but why reset the page status?
>> 
>> I bumped into this because my db.fetch.interval.default >
>> db.fetch.interval.max ;)) After a couple of successful crawl cycles, all
>> of my webpages still were STATUS_UNFETCHED.
>> 
>> Cheers,
>> Mathijs
> 
> -- 
> Markus Jelsma - CTO - Openindex

Re: [nutchgora] AbstractFetchSchedule.forceFetch method resets fetch status

Reply via email to