Thank's a lot! I really appreciate your answers On Fri, Sep 2, 2011 at 10:25 AM, Markus Jelsma <[email protected]>wrote:
> > > On Friday 02 September 2011 15:06:59 Julio Garcés Teuber wrote: > > Hi Lewis! > > > > Sorry for the delay in coming back to you but I was busy attacking other > > fronts. Now I'm back in full with Nutch integration. Summarizing your > tips > > we have the following: > > > > - In order to check which pages have changed I can use adaptive fetching > > interval as reference. I can find more on this subject on > nutch-default.xml > > and Nutch discussion lists. > > - Another way to track changes would be to make dumps before and after > > crawling > > - Finally to find out which pages have been deleted you recommend to > check > > the log. > > Easiest method is to readdb -dump the crawldb and grep for db_gone > > > May I ask which log? Also do you think the log has a detailed list > > of deleted pages or just the total count > > readdb -stats shows the sum of 404's. > > > ? Will this also remove the indexes > > for deleted pages on Nutch? > > Solrclean tool will do that for you. > > > > Thank you once again for your help I will do my homework with the first > two > > bullets and will highly appreciate more info on the third. > > > > Cheers! > > Julio. > > > > On Wed, Jul 27, 2011 at 5:34 PM, lewis john mcgibbney < > > > > [email protected]> wrote: > > > Hi Julio, > > > > > > The algorithm you are referring to is called the adaptive fetching > > > interval in Nutch. There is some basic reading on this in > > > nutch-default.xml and should also be a good deal on the user@ list (as > > > well as dev@). If you require more information on this then please say > > > however I'm sure you should be able to suss it out. > > > > > > Information on your post processing is quite vague and doesn't give > much > > > indication of exactly what data we need in order to streamline your > post > > > processing activity, for clarity is it possible to expand upon what you > > > provided? > > > > > > I know this sounds really basic, but you could do a dump of your > crawldb > > > before and after for comparison or similarity analysis. This way we > would > > > find the status of URLs in crawldb as well as when they were last > fetched > > > and whether or not then have been updated since last crawl. > > > > > > Solr clean will remove various pages you mention, as a method for > > > reflecting an accurate representation of the web graph in your index, > > > however, again I am not entirelty sure if we have a method for > > > determining exactly which pages were removed, however we do get log > > > output telling us how many pages were removed. > > > > > > On Wed, Jul 27, 2011 at 3:44 PM, Julio Garcés Teuber > <[email protected]>wrote: > > >> I have configured a *Nutch* instance to continuously crawl a > particular > > >> site. I have successfully managed to get the website data in bulk and > > >> base on that to post process that information. > > >> > > >> The problem I'm facing now is that every time I run the crawling > process > > >> I have to post process the whole site. I want to optimize the post > > >> processing and in order to do so I need to get from *Nutch* the list > of > > >> pages that have changed since the last crawling process was run. > > >> > > >> What's the best way to do that? Is there already a mechanism in > > >> *Nutch*that keeps track of the last time the content of a page has > > >> changed? Do have to create an registry of crawled pages with and md5 > > >> for example and keep track of changes my self? Aside tracking which > > >> pages have changed I also need to track which pages have been removed > > >> since last crawl. Is there a specific mechanism to track removed pages > > >> (i.e. 404, 301, 302 HTTP codes)? > > >> > > >> Any tips, ideas or sharing of experiences will be more than welcome > and > > >> I will gladly share the lessons learnt once I have the thing running. > > >> > > >> Hugs, > > >> Julio > > >> > > >> -- > > >> XNG | Julio Garcés Teuber > > >> Email: [email protected] > > >> Skype: julio.xng > > >> Tel: +54 (11) 4777.9488 > > >> Fax: +1 (320) 514.4271 > > >> http://www.xinergia.com/ > > > > > > -- > > > *Lewis* > > -- > Markus Jelsma - CTO - Openindex > http://www.linkedin.com/in/markus17 > 050-8536620 / 06-50258350 > -- XNG | Julio Garcés Teuber Email: [email protected] Skype: julio.xng Tel: +54 (11) 4777.9488 Fax: +1 (320) 514.4271 http://www.xinergia.com/

