Re: Problem with DB_GONE status

Matt Kangas Thu, 23 Feb 2006 08:28:47 -0800

+1

I see these issues on a daily basis with my directed crawl. Sometimesit's due to a DNS hiccup, sometimes a route is down, sometimes it'sme upgrading our router when the crawl is scheduled.

What's most frustrating are pages that fetch perfectly every week formonths on end, and then are suddenly set to STATUS_DB_GONE due to atransient failure. I find myself wishing that the webdb had somememory of past success, but I realize that this would be prettyexpensive to implement in general.

I work around this at present by periodically resetting the crawl-time on pages with status STATUS_DB_GONE, but that is a hideous hack.Andrzej's suggestion seems like an elegant solution to this problem.


--matt

On Feb 23, 2006, at 8:23 AM, Andrzej Bialecki wrote:

Hi,

I have some doubts about how the current code interprets this state.
Currently, if a page in CrawlDB ends up in STATUS_DB_GONE, it willstay forever in that state. But it's easy to imagine a scenariowhere a page is not accessible for a while (e.g. the server is downfor a longer period and then is restored, or someone made a mistakeand linked to a /media/year2006/April), but then after a while itbecomes available again.
As it is now, Generator will always skip pages with this status, sothey have no chance of being ever revisited. I propose to nevertreat such pages as truly gone, but instead to increasesignificantly their re-fetch interval. This way eventually we willbe able to check if the page is back again - if not, we increasethe interval again, but if it's back we have a chance to reset thestatus to DB_FETCHED.
--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


--
Matt Kangas / [EMAIL PROTECTED]

Re: Problem with DB_GONE status

Reply via email to