Can you please suggest how to go about implementing this? I would like
to add this check.


On Thu, 2005-04-21 at 13:14, Jérôme Charron wrote:
> >
> > What i mean by expired pages is those pages whose last Modified date has
> > changed since last fetch.
> > Whole-web crawling fetches all pages that are due to be fetched (e.g,
> > every 30 days). These pages may not have actually changed in content. I
> > would like to know if there is any way to tell Nutch to compare the last
> > modified date and fetch the page only if the date is different from what
> > is there in the index. I think this way we can save time by fetching and
> > indexing only the modified pages while re-crawling the same site after
> > some time.
>
>
> I have suggested many time ago use the HEAD method or the GET header
> If-Modified-Since (as sugested by Otis) in order to fetch only changed
> documents.
> The discussion is here:
> http://www.mail-archive.com/[email protected]/msg00091.html
> But actually I don't find time to implement this feature...
>
> Jerome


This e-mail and any files transmitted with it are for the sole use of the 
intended recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient, please contact the sender by reply 
e-mail and destroy all copies of the original message.
Any unauthorised review, use, disclosure, dissemination, forwarding, printing 
or copying of this email or any action taken in reliance on this e-mail is 
strictly
prohibited and may be unlawful.

  Visit us at http://www.cognizant.com

Reply via email to