Re: Problem with DB_GONE status

Doug Cutting Thu, 23 Feb 2006 10:24:44 -0800

Andrzej Bialecki wrote:

Currently, if a page in CrawlDB ends up in STATUS_DB_GONE, it will stayforever in that state. But it's easy to imagine a scenario where a pageis not accessible for a while (e.g. the server is down for a longerperiod and then is restored, or someone made a mistake and linked to a/media/year2006/April), but then after a while it becomes available again.
As it is now, Generator will always skip pages with this status, so theyhave no chance of being ever revisited. I propose to never treat suchpages as truly gone, but instead to increase significantly theirre-fetch interval. This way eventually we will be able to check if thepage is back again - if not, we increase the interval again, but if it'sback we have a chance to reset the status to DB_FETCHED.


That sounds reasonable to me.  +1

Doug

Re: Problem with DB_GONE status

Reply via email to