Re: Meaning of ProtocolStatus.ACCESS_DENIED

Andrzej Bialecki Mon, 03 Aug 2009 03:55:03 -0700

Otis Gospodnetic wrote:

I don't know of an elegant way, but if you want to hack Nutch
sources, you could set its refetch time to some point in time
veeerrrry far in the future, for example.  Or introduce additional
status.

This won't work, because the pages will be checked again after amaximum.fetch.interval.

Pages that return ACCESS_DENIED may do so only for some time, so Nutchneeds to check their status periodically. In a sense, no page is evertruly GONE, if only for the reason that we somehow need to representnonexistent targets of stale links - if we removed these URLs from thedb they would be soon rediscovered and added again.

The gory details of maximum.fetch.interval follow .. Nutch periodicallychecks the status of all pages in CrawlDb, no matter what their state,including GONE, ACCESS_DENIED, ROBOTS_DENIED, etc. If you use someadaptive re-fetch strategy (AdaptiveFetchSchedule) then the re-fetchinterval will be set at maximum value in a few cycles, so the checkingwon't occur too often. You may be tempted to set this to infinity, i.e.to never check these URLs again. However, the purpose of having aspecific value for maximum refetch interval is to be able to phase outold segments, so that you can be sure that you can delete old segmentsafter N days, because all their pages have been surely scheduled forrefetching and will be found in a newer segment.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Meaning of ProtocolStatus.ACCESS_DENIED

Reply via email to