chris sleeman wrote:
Hi,

Can someone please explain how the fetcher behaves with respect to
modified/unmodified content, in the current trunk version?

My requirement is basically this - I
have one page (seed url) which has links to other urls. The links in
this page, keeps getting changed on a daily basis.
I want nutch to keep refetching this page, as it changes regularly,
but not refetch the outlinks on this page since they are more or less
static.

Nutch will behave differently, depending on which fetch schedule you're using. With the DefaultFetchSchedule, the refetch period is fixed and doesn't change, no matter if a page as modified or not. With AdaptiveFetchSchedule Nutch will adjust refetch interval to match the expected period of changes.

In any case, if a page is not modified, Nutch will try to avoid fetching it again (using If-Modified-Since headers).


I have set both "db.fetch.interval.default" and "db.fetch.interval.max" to a
high value of apprx 1 year and am using the DefaultFetchSchedule
class. Does this imply that even for pages which have been modified,
the next fetch would be after an year?

Correct. Nutch doesn't know that a page is changed, unless it actually tries to fetch it. Since you're using the DefaultFetchSchedule, and the fetch interval is 1 year, Nutch will check the page in 1 year interval, and it will never adjust the interval no matter what's the status of the page.

However, this is not strictly true. Even if you set a very high value of this interval, there is a hard limit (db.fetch.interval.max), and pages older than this interval will be scheduled for refetching, no matter what their fetch interval.

In your case, I would recommend setting a very short interval for the main page, and setting longer (default) intervals for other pages. Additionally, you can use AdaptiveFetchSchedule to adjust these intervals.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to