I have configured a *Nutch* instance to continuously crawl a particular site. I have successfully managed to get the website data in bulk and base on that to post process that information.
The problem I'm facing now is that every time I run the crawling process I have to post process the whole site. I want to optimize the post processing and in order to do so I need to get from *Nutch* the list of pages that have changed since the last crawling process was run. What's the best way to do that? Is there already a mechanism in *Nutch* that keeps track of the last time the content of a page has changed? Do have to create an registry of crawled pages with and md5 for example and keep track of changes my self? Aside tracking which pages have changed I also need to track which pages have been removed since last crawl. Is there a specific mechanism to track removed pages (i.e. 404, 301, 302 HTTP codes)? Any tips, ideas or sharing of experiences will be more than welcome and I will gladly share the lessons learnt once I have the thing running. Hugs, Julio -- XNG | Julio Garcés Teuber Email: [email protected] Skype: julio.xng Tel: +54 (11) 4777.9488 Fax: +1 (320) 514.4271 http://www.xinergia.com/

