Can you please suggest how to go about implementing this? I would like to add this check.
On Thu, 2005-04-21 at 13:14, Jérôme Charron wrote: > > > > What i mean by expired pages is those pages whose last Modified date has > > changed since last fetch. > > Whole-web crawling fetches all pages that are due to be fetched (e.g, > > every 30 days). These pages may not have actually changed in content. I > > would like to know if there is any way to tell Nutch to compare the last > > modified date and fetch the page only if the date is different from what > > is there in the index. I think this way we can save time by fetching and > > indexing only the modified pages while re-crawling the same site after > > some time. > > > I have suggested many time ago use the HEAD method or the GET header > If-Modified-Since (as sugested by Otis) in order to fetch only changed > documents. > The discussion is here: > http://www.mail-archive.com/[email protected]/msg00091.html > But actually I don't find time to implement this feature... > > Jerome This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorised review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful. Visit us at http://www.cognizant.com
