Andrzej Bialecki wrote:
Currently, if a page in CrawlDB ends up in STATUS_DB_GONE, it will stay
forever in that state. But it's easy to imagine a scenario where a page
is not accessible for a while (e.g. the server is down for a longer
period and then is restored, or someone made a mistake and linked to a
/media/year2006/April), but then after a while it becomes available again.
As it is now, Generator will always skip pages with this status, so they
have no chance of being ever revisited. I propose to never treat such
pages as truly gone, but instead to increase significantly their
re-fetch interval. This way eventually we will be able to check if the
page is back again - if not, we increase the interval again, but if it's
back we have a chance to reset the status to DB_FETCHED.
That sounds reasonable to me. +1
Doug