Hi Julio, The algorithm you are referring to is called the adaptive fetching interval in Nutch. There is some basic reading on this in nutch-default.xml and should also be a good deal on the user@ list (as well as dev@). If you require more information on this then please say however I'm sure you should be able to suss it out.
Information on your post processing is quite vague and doesn't give much indication of exactly what data we need in order to streamline your post processing activity, for clarity is it possible to expand upon what you provided? I know this sounds really basic, but you could do a dump of your crawldb before and after for comparison or similarity analysis. This way we would find the status of URLs in crawldb as well as when they were last fetched and whether or not then have been updated since last crawl. Solr clean will remove various pages you mention, as a method for reflecting an accurate representation of the web graph in your index, however, again I am not entirelty sure if we have a method for determining exactly which pages were removed, however we do get log output telling us how many pages were removed. On Wed, Jul 27, 2011 at 3:44 PM, Julio Garcés Teuber <[email protected]>wrote: > I have configured a *Nutch* instance to continuously crawl a particular > site. I have successfully managed to get the website data in bulk and base > on that to post process that information. > > The problem I'm facing now is that every time I run the crawling process I > have to post process the whole site. I want to optimize the post processing > and in order to do so I need to get from *Nutch* the list of pages that > have changed since the last crawling process was run. > > What's the best way to do that? Is there already a mechanism in *Nutch*that > keeps track of the last time the content of a page has changed? Do have > to create an registry of crawled pages with and md5 for example and keep > track of changes my self? Aside tracking which pages have changed I also > need to track which pages have been removed since last crawl. Is there a > specific mechanism to track removed pages (i.e. 404, 301, 302 HTTP codes)? > > Any tips, ideas or sharing of experiences will be more than welcome and I > will gladly share the lessons learnt once I have the thing running. > > Hugs, > Julio > > -- > XNG | Julio Garcés Teuber > Email: [email protected] > Skype: julio.xng > Tel: +54 (11) 4777.9488 > Fax: +1 (320) 514.4271 > http://www.xinergia.com/ > -- *Lewis*

