Kannan Sundaramoorthy wrote:
I would like to perform an incremental crawling using Nutch. What I want
to do is to configure Nutch in such a way that it should check for
expired pages and issue new crawls to the expires pages only. Other requirements are:
1. Ability to inject new urls to the crawl database. When
incremental crawling begins, nutch should crawl the newly
injected urls. 2. After an incremental crawl is completed, either a new search
index should be created or the previous search index should be
updated.


Can anyone suggest how to achieve this?

This sounds like the "Whole-web Crawling" as described in the tutorial:

http://incubator.apache.org/nutch/tutorial.html#Whole-web+Crawling

By default this method will expire and recrawl urls every 30 days.

Doug

Reply via email to