I have configured a Nutch instance to continuously crawl a particular site. I
have successfully managed to get the website data in bulk and base on that
to post process that information.

The problem I'm facing now is that every time I run the crawling process I
have to post process the whole site. I want to optimize the post processing
and in order to do so I need to get from Nutch the list of pages that have
changed since the last crawling process was run.

What's the best way to do that? Is there already a mechanism in Nutch that
keeps track of the last time the content of a page has changed? Do have to
create an registry of crawled pages with and md5 for example and keep track
of changes my self? Aside tracking which pages have changed I also need to
track which pages have been removed since last crawl. Is there a specific
mechanism to track removed pages (i.e. 404, 301, 302 HTTP codes)?

Any tips, ideas or sharing of experiences will be more than welcome and I
will gladly share the lessons learnt once I have the thing running.

Hugs,
Julio 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Tracking-change-between-crawlings-and-page-deletion-tp3203678p3203678.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Reply via email to