+1

I see these issues on a daily basis with my directed crawl. Sometimes it's due to a DNS hiccup, sometimes a route is down, sometimes it's me upgrading our router when the crawl is scheduled.

What's most frustrating are pages that fetch perfectly every week for months on end, and then are suddenly set to STATUS_DB_GONE due to a transient failure. I find myself wishing that the webdb had some memory of past success, but I realize that this would be pretty expensive to implement in general.

I work around this at present by periodically resetting the crawl- time on pages with status STATUS_DB_GONE, but that is a hideous hack. Andrzej's suggestion seems like an elegant solution to this problem.

--matt

On Feb 23, 2006, at 8:23 AM, Andrzej Bialecki wrote:

Hi,

I have some doubts about how the current code interprets this state.

Currently, if a page in CrawlDB ends up in STATUS_DB_GONE, it will stay forever in that state. But it's easy to imagine a scenario where a page is not accessible for a while (e.g. the server is down for a longer period and then is restored, or someone made a mistake and linked to a /media/year2006/April), but then after a while it becomes available again.

As it is now, Generator will always skip pages with this status, so they have no chance of being ever revisited. I propose to never treat such pages as truly gone, but instead to increase significantly their re-fetch interval. This way eventually we will be able to check if the page is back again - if not, we increase the interval again, but if it's back we have a chance to reset the status to DB_FETCHED.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


--
Matt Kangas / [EMAIL PROTECTED]


Reply via email to