Hi Andrzej, Thanks for your response. However, I still have a couple of doubts.
>In your case, I would recommend setting a very short interval for the >main page, and setting longer (default) intervals for other pages. Isnt' the fetch interval a system wide setting? Or can we set it for individual urls? What I would basically need is a different fetch interval for injected (seed urls) as compared to the other urls. Since this may not be available out of the box, I was thinking of just modifying the injector code and using a much different value for the fetch interval, in this case. Would such an approach work? and will the same fetch value, set once per url, be used throughout? Thanks and Regards, Chris On 10/13/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > > chris sleeman wrote: > > Hi, > > > > Can someone please explain how the fetcher behaves with respect to > > modified/unmodified content, in the current trunk version? > > > > My requirement is basically this - I > > have one page (seed url) which has links to other urls. The links in > > this page, keeps getting changed on a daily basis. > > I want nutch to keep refetching this page, as it changes regularly, > > but not refetch the outlinks on this page since they are more or less > > static. > > Nutch will behave differently, depending on which fetch schedule you're > using. With the DefaultFetchSchedule, the refetch period is fixed and > doesn't change, no matter if a page as modified or not. With > AdaptiveFetchSchedule Nutch will adjust refetch interval to match the > expected period of changes. > > In any case, if a page is not modified, Nutch will try to avoid fetching > it again (using If-Modified-Since headers). > > > > > I have set both "db.fetch.interval.default" and "db.fetch.interval.max" > to a > > high value of apprx 1 year and am using the DefaultFetchSchedule > > class. Does this imply that even for pages which have been modified, > > the next fetch would be after an year? > > Correct. Nutch doesn't know that a page is changed, unless it actually > tries to fetch it. Since you're using the DefaultFetchSchedule, and the > fetch interval is 1 year, Nutch will check the page in 1 year interval, > and it will never adjust the interval no matter what's the status of the > page. > > However, this is not strictly true. Even if you set a very high value of > this interval, there is a hard limit (db.fetch.interval.max), and pages > older than this interval will be scheduled for refetching, no matter > what their fetch interval. > > In your case, I would recommend setting a very short interval for the > main page, and setting longer (default) intervals for other pages. > Additionally, you can use AdaptiveFetchSchedule to adjust these intervals. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >
