Page deletion and tracking change between crawlings

Julio Garcés Teuber Wed, 27 Jul 2011 07:45:39 -0700

I have configured a *Nutch* instance to continuously crawl a particular
site. I have successfully managed to get the website data in bulk and base
on that to post process that information.


The problem I'm facing now is that every time I run the crawling process I
have to post process the whole site. I want to optimize the post processing
and in order to do so I need to get from *Nutch* the list of pages that have
changed since the last crawling process was run.

What's the best way to do that? Is there already a mechanism in *Nutch* that
keeps track of the last time the content of a page has changed? Do have to
create an registry of crawled pages with and md5 for example and keep track
of changes my self? Aside tracking which pages have changed I also need to
track which pages have been removed since last crawl. Is there a specific
mechanism to track removed pages (i.e. 404, 301, 302 HTTP codes)?

Any tips, ideas or sharing of experiences will be more than welcome and I
will gladly share the lessons learnt once I have the thing running.

Hugs,
Julio

-- 
XNG | Julio Garcés Teuber
Email: [email protected]
Skype: julio.xng
Tel: +54 (11) 4777.9488
Fax: +1 (320) 514.4271
http://www.xinergia.com/

Page deletion and tracking change between crawlings

Reply via email to