What i mean by expired pages is those pages whose last Modified date has
changed since last fetch.
Whole-web crawling fetches all pages that are due to be fetched (e.g,
every 30 days). These pages may not have actually changed in content. I
would like to know if there is any way to tell Nutch to compare the last
modified date and fetch the page only if the date is different from what
is there in the index. I think this way we can save time by fetching and
indexing only the modified pages while re-crawling the same site after
some time.


Kannan
On Tue, 2005-04-19 at 21:38, Doug Cutting wrote:
> Kannan Sundaramoorthy wrote:
> > I would like to perform an incremental crawling using Nutch. What I want
> > to do is to configure Nutch in such a way that it should check for
> > expired pages and issue new crawls to the expires pages only. 
> > Other requirements are:
> >      1. Ability to inject new urls to the crawl database. When
> >         incremental crawling begins, nutch should crawl the newly
> >         injected urls. 
> >         
> >      2. After an incremental crawl is completed, either a new search
> >         index should be created or the previous search index should be
> >         updated. 
> > 
> > Can anyone suggest how to achieve this?
> 
> This sounds like the "Whole-web Crawling" as described in the tutorial:
> 
> http://incubator.apache.org/nutch/tutorial.html#Whole-web+Crawling
> 
> By default this method will expire and recrawl urls every 30 days.
> 
> Doug


This e-mail and any files transmitted with it are for the sole use of the 
intended recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient, please contact the sender by reply 
e-mail and destroy all copies of the original message. 
Any unauthorised review, use, disclosure, dissemination, forwarding, printing 
or copying of this email or any action taken in reliance on this e-mail is 
strictly 
prohibited and may be unlawful.

  Visit us at http://www.cognizant.com

Reply via email to