Andrzej Bialecki wrote:
Currently, if a page in CrawlDB ends up in STATUS_DB_GONE, it will stay forever in that state. But it's easy to imagine a scenario where a page is not accessible for a while (e.g. the server is down for a longer period and then is restored, or someone made a mistake and linked to a /media/year2006/April), but then after a while it becomes available again.

As it is now, Generator will always skip pages with this status, so they have no chance of being ever revisited. I propose to never treat such pages as truly gone, but instead to increase significantly their re-fetch interval. This way eventually we will be able to check if the page is back again - if not, we increase the interval again, but if it's back we have a chance to reset the status to DB_FETCHED.

That sounds reasonable to me.  +1

Doug

Reply via email to