Hi Lewis! Sorry for the delay in coming back to you but I was busy attacking other fronts. Now I'm back in full with Nutch integration. Summarizing your tips we have the following:
- In order to check which pages have changed I can use adaptive fetching interval as reference. I can find more on this subject on nutch-default.xml and Nutch discussion lists. - Another way to track changes would be to make dumps before and after crawling - Finally to find out which pages have been deleted you recommend to check the log. May I ask which log? Also do you think the log has a detailed list of deleted pages or just the total count? Will this also remove the indexes for deleted pages on Nutch? Thank you once again for your help I will do my homework with the first two bullets and will highly appreciate more info on the third. Cheers! Julio. On Wed, Jul 27, 2011 at 5:34 PM, lewis john mcgibbney < [email protected]> wrote: > Hi Julio, > > The algorithm you are referring to is called the adaptive fetching interval > in Nutch. There is some basic reading on this in nutch-default.xml and > should also be a good deal on the user@ list (as well as dev@). If you > require more information on this then please say however I'm sure you should > be able to suss it out. > > Information on your post processing is quite vague and doesn't give much > indication of exactly what data we need in order to streamline your post > processing activity, for clarity is it possible to expand upon what you > provided? > > I know this sounds really basic, but you could do a dump of your crawldb > before and after for comparison or similarity analysis. This way we would > find the status of URLs in crawldb as well as when they were last fetched > and whether or not then have been updated since last crawl. > > Solr clean will remove various pages you mention, as a method for > reflecting an accurate representation of the web graph in your index, > however, again I am not entirelty sure if we have a method for determining > exactly which pages were removed, however we do get log output telling us > how many pages were removed. > > On Wed, Jul 27, 2011 at 3:44 PM, Julio Garcés Teuber > <[email protected]>wrote: > >> I have configured a *Nutch* instance to continuously crawl a particular >> site. I have successfully managed to get the website data in bulk and base >> on that to post process that information. >> >> The problem I'm facing now is that every time I run the crawling process I >> have to post process the whole site. I want to optimize the post processing >> and in order to do so I need to get from *Nutch* the list of pages that >> have changed since the last crawling process was run. >> >> What's the best way to do that? Is there already a mechanism in *Nutch*that >> keeps track of the last time the content of a page has changed? Do have >> to create an registry of crawled pages with and md5 for example and keep >> track of changes my self? Aside tracking which pages have changed I also >> need to track which pages have been removed since last crawl. Is there a >> specific mechanism to track removed pages (i.e. 404, 301, 302 HTTP codes)? >> >> Any tips, ideas or sharing of experiences will be more than welcome and I >> will gladly share the lessons learnt once I have the thing running. >> >> Hugs, >> Julio >> >> -- >> XNG | Julio Garcés Teuber >> Email: [email protected] >> Skype: julio.xng >> Tel: +54 (11) 4777.9488 >> Fax: +1 (320) 514.4271 >> http://www.xinergia.com/ >> > > > > -- > *Lewis* > > -- XNG | Julio Garcés Teuber Email: [email protected] Skype: julio.xng Tel: +54 (11) 4777.9488 Fax: +1 (320) 514.4271 http://www.xinergia.com/

