Re: Incremental Crawling

Doug Cutting Tue, 19 Apr 2005 09:08:22 -0700

Kannan Sundaramoorthy wrote:

I would like to perform an incremental crawling using Nutch. What I want to do is to configure Nutch in such a way that it should check for expired pages and issue new crawls to the expires pages only. Other requirements are: 1. Ability to inject new urls to the crawl database. When incremental crawling begins, nutch should crawl the newly injected urls. 2. After an incremental crawl is completed, either a new search index should be created or the previous search index should be updated.

Can anyone suggest how to achieve this?


This sounds like the "Whole-web Crawling" as described in the tutorial:

http://incubator.apache.org/nutch/tutorial.html#Whole-web+Crawling

By default this method will expire and recrawl urls every 30 days.

Doug

Re: Incremental Crawling

Reply via email to