+1
I see these issues on a daily basis with my directed crawl. Sometimes
it's due to a DNS hiccup, sometimes a route is down, sometimes it's
me upgrading our router when the crawl is scheduled.
What's most frustrating are pages that fetch perfectly every week for
months on end, and then are suddenly set to STATUS_DB_GONE due to a
transient failure. I find myself wishing that the webdb had some
memory of past success, but I realize that this would be pretty
expensive to implement in general.
I work around this at present by periodically resetting the crawl-
time on pages with status STATUS_DB_GONE, but that is a hideous hack.
Andrzej's suggestion seems like an elegant solution to this problem.
--matt
On Feb 23, 2006, at 8:23 AM, Andrzej Bialecki wrote:
Hi,
I have some doubts about how the current code interprets this state.
Currently, if a page in CrawlDB ends up in STATUS_DB_GONE, it will
stay forever in that state. But it's easy to imagine a scenario
where a page is not accessible for a while (e.g. the server is down
for a longer period and then is restored, or someone made a mistake
and linked to a /media/year2006/April), but then after a while it
becomes available again.
As it is now, Generator will always skip pages with this status, so
they have no chance of being ever revisited. I propose to never
treat such pages as truly gone, but instead to increase
significantly their re-fetch interval. This way eventually we will
be able to check if the page is back again - if not, we increase
the interval again, but if it's back we have a chance to reset the
status to DB_FETCHED.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
--
Matt Kangas / [EMAIL PROTECTED]