I have configured a Nutch instance to continuously crawl a particular site. I have successfully managed to get the website data in bulk and base on that to post process that information.
The problem I'm facing now is that every time I run the crawling process I have to post process the whole site. I want to optimize the post processing and in order to do so I need to get from Nutch the list of pages that have changed since the last crawling process was run. What's the best way to do that? Is there already a mechanism in Nutch that keeps track of the last time the content of a page has changed? Do have to create an registry of crawled pages with and md5 for example and keep track of changes my self? Aside tracking which pages have changed I also need to track which pages have been removed since last crawl. Is there a specific mechanism to track removed pages (i.e. 404, 301, 302 HTTP codes)? Any tips, ideas or sharing of experiences will be more than welcome and I will gladly share the lessons learnt once I have the thing running. Hugs, Julio -- View this message in context: http://lucene.472066.n3.nabble.com/Tracking-change-between-crawlings-and-page-deletion-tp3203678p3203678.html Sent from the Nutch - Dev mailing list archive at Nabble.com.

