On Friday 02 September 2011 15:06:59 Julio Garcés Teuber wrote: > Hi Lewis! > > Sorry for the delay in coming back to you but I was busy attacking other > fronts. Now I'm back in full with Nutch integration. Summarizing your tips > we have the following: > > - In order to check which pages have changed I can use adaptive fetching > interval as reference. I can find more on this subject on nutch-default.xml > and Nutch discussion lists. > - Another way to track changes would be to make dumps before and after > crawling > - Finally to find out which pages have been deleted you recommend to check > the log.
Easiest method is to readdb -dump the crawldb and grep for db_gone > May I ask which log? Also do you think the log has a detailed list > of deleted pages or just the total count readdb -stats shows the sum of 404's. > ? Will this also remove the indexes > for deleted pages on Nutch? Solrclean tool will do that for you. > > Thank you once again for your help I will do my homework with the first two > bullets and will highly appreciate more info on the third. > > Cheers! > Julio. > > On Wed, Jul 27, 2011 at 5:34 PM, lewis john mcgibbney < > > [email protected]> wrote: > > Hi Julio, > > > > The algorithm you are referring to is called the adaptive fetching > > interval in Nutch. There is some basic reading on this in > > nutch-default.xml and should also be a good deal on the user@ list (as > > well as dev@). If you require more information on this then please say > > however I'm sure you should be able to suss it out. > > > > Information on your post processing is quite vague and doesn't give much > > indication of exactly what data we need in order to streamline your post > > processing activity, for clarity is it possible to expand upon what you > > provided? > > > > I know this sounds really basic, but you could do a dump of your crawldb > > before and after for comparison or similarity analysis. This way we would > > find the status of URLs in crawldb as well as when they were last fetched > > and whether or not then have been updated since last crawl. > > > > Solr clean will remove various pages you mention, as a method for > > reflecting an accurate representation of the web graph in your index, > > however, again I am not entirelty sure if we have a method for > > determining exactly which pages were removed, however we do get log > > output telling us how many pages were removed. > > > > On Wed, Jul 27, 2011 at 3:44 PM, Julio Garcés Teuber <[email protected]>wrote: > >> I have configured a *Nutch* instance to continuously crawl a particular > >> site. I have successfully managed to get the website data in bulk and > >> base on that to post process that information. > >> > >> The problem I'm facing now is that every time I run the crawling process > >> I have to post process the whole site. I want to optimize the post > >> processing and in order to do so I need to get from *Nutch* the list of > >> pages that have changed since the last crawling process was run. > >> > >> What's the best way to do that? Is there already a mechanism in > >> *Nutch*that keeps track of the last time the content of a page has > >> changed? Do have to create an registry of crawled pages with and md5 > >> for example and keep track of changes my self? Aside tracking which > >> pages have changed I also need to track which pages have been removed > >> since last crawl. Is there a specific mechanism to track removed pages > >> (i.e. 404, 301, 302 HTTP codes)? > >> > >> Any tips, ideas or sharing of experiences will be more than welcome and > >> I will gladly share the lessons learnt once I have the thing running. > >> > >> Hugs, > >> Julio > >> > >> -- > >> XNG | Julio Garcés Teuber > >> Email: [email protected] > >> Skype: julio.xng > >> Tel: +54 (11) 4777.9488 > >> Fax: +1 (320) 514.4271 > >> http://www.xinergia.com/ > > > > -- > > *Lewis* -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

