Otis Gospodnetic wrote:
I don't know of an elegant way, but if you want to hack Nutch
sources, you could set its refetch time to some point in time
veeerrrry far in the future, for example.  Or introduce additional
status.

This won't work, because the pages will be checked again after a maximum.fetch.interval.

Pages that return ACCESS_DENIED may do so only for some time, so Nutch needs to check their status periodically. In a sense, no page is ever truly GONE, if only for the reason that we somehow need to represent nonexistent targets of stale links - if we removed these URLs from the db they would be soon rediscovered and added again.

The gory details of maximum.fetch.interval follow .. Nutch periodically checks the status of all pages in CrawlDb, no matter what their state, including GONE, ACCESS_DENIED, ROBOTS_DENIED, etc. If you use some adaptive re-fetch strategy (AdaptiveFetchSchedule) then the re-fetch interval will be set at maximum value in a few cycles, so the checking won't occur too often. You may be tempted to set this to infinity, i.e. to never check these URLs again. However, the purpose of having a specific value for maximum refetch interval is to be able to phase out old segments, so that you can be sure that you can delete old segments after N days, because all their pages have been surely scheduled for refetching and will be found in a newer segment.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to