What i mean by expired pages is those pages whose last Modified date has changed since last fetch. Whole-web crawling fetches all pages that are due to be fetched (e.g, every 30 days). These pages may not have actually changed in content. I would like to know if there is any way to tell Nutch to compare the last modified date and fetch the page only if the date is different from what is there in the index. I think this way we can save time by fetching and indexing only the modified pages while re-crawling the same site after some time.
Kannan On Tue, 2005-04-19 at 21:38, Doug Cutting wrote: > Kannan Sundaramoorthy wrote: > > I would like to perform an incremental crawling using Nutch. What I want > > to do is to configure Nutch in such a way that it should check for > > expired pages and issue new crawls to the expires pages only. > > Other requirements are: > > 1. Ability to inject new urls to the crawl database. When > > incremental crawling begins, nutch should crawl the newly > > injected urls. > > > > 2. After an incremental crawl is completed, either a new search > > index should be created or the previous search index should be > > updated. > > > > Can anyone suggest how to achieve this? > > This sounds like the "Whole-web Crawling" as described in the tutorial: > > http://incubator.apache.org/nutch/tutorial.html#Whole-web+Crawling > > By default this method will expire and recrawl urls every 30 days. > > Doug This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorised review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful. Visit us at http://www.cognizant.com
