Re: Page deletion and tracking change between crawlings

Julio Garcés Teuber Fri, 02 Sep 2011 07:17:31 -0700

 Thank's a lot!  I really appreciate your answers

On Fri, Sep 2, 2011 at 10:25 AM, Markus Jelsma
<[email protected]>wrote:


>
>
> On Friday 02 September 2011 15:06:59 Julio Garcés Teuber wrote:
> > Hi Lewis!
> >
> > Sorry for the delay in coming back to you but I was busy attacking other
> > fronts. Now I'm back in full with Nutch integration. Summarizing your
> tips
> > we have the following:
> >
> > - In order to check which pages have changed I can use adaptive fetching
> > interval as reference. I can find more on this subject on
> nutch-default.xml
> > and Nutch discussion lists.
> > - Another way to track changes would be to make dumps before and after
> > crawling
> > - Finally to find out which pages have been deleted you recommend to
> check
> > the log.
>
> Easiest method is to readdb -dump the crawldb and grep for db_gone
>
> > May I ask which log? Also do you think the log has a detailed list
> > of deleted pages or just the total count
>
> readdb -stats shows the sum of 404's.
>
> > ? Will this also remove the indexes
> > for deleted pages on Nutch?
>
> Solrclean tool will do that for you.
> >
> > Thank you once again for your help I will do my homework with the first
> two
> > bullets and will highly appreciate more info on the third.
> >
> > Cheers!
> > Julio.
> >
> > On Wed, Jul 27, 2011 at 5:34 PM, lewis john mcgibbney <
> >
> > [email protected]> wrote:
> > > Hi Julio,
> > >
> > > The algorithm you are referring to is called the adaptive fetching
> > > interval in Nutch. There is some basic reading on this in
> > > nutch-default.xml and should also be a good deal on the user@ list (as
> > > well as dev@). If you require more information on this then please say
> > > however I'm sure you should be able to suss it out.
> > >
> > > Information on your post processing is quite vague and doesn't give
> much
> > > indication of exactly what data we need in order to streamline your
> post
> > > processing activity, for clarity is it possible to expand upon what you
> > > provided?
> > >
> > > I know this sounds really basic, but you could do a dump of your
> crawldb
> > > before and after for comparison or similarity analysis. This way we
> would
> > > find the status of URLs in crawldb as well as when they were last
> fetched
> > > and whether or not then have been updated since last crawl.
> > >
> > > Solr clean will remove various pages you mention, as a method for
> > > reflecting an accurate representation of the web graph in your index,
> > > however, again I am not entirelty sure if we have a method for
> > > determining exactly which pages were removed, however we do get log
> > > output telling us how many pages were removed.
> > >
> > > On Wed, Jul 27, 2011 at 3:44 PM, Julio Garcés Teuber
> <[email protected]>wrote:
> > >> I have configured a *Nutch* instance to continuously crawl a
> particular
> > >> site. I have successfully managed to get the website data in bulk and
> > >> base on that to post process that information.
> > >>
> > >> The problem I'm facing now is that every time I run the crawling
> process
> > >> I have to post process the whole site. I want to optimize the post
> > >> processing and in order to do so I need to get from *Nutch* the list
> of
> > >> pages that have changed since the last crawling process was run.
> > >>
> > >> What's the best way to do that? Is there already a mechanism in
> > >> *Nutch*that keeps track of the last time the content of a page has
> > >> changed? Do have to create an registry of crawled pages with and md5
> > >> for example and keep track of changes my self? Aside tracking which
> > >> pages have changed I also need to track which pages have been removed
> > >> since last crawl. Is there a specific mechanism to track removed pages
> > >> (i.e. 404, 301, 302 HTTP codes)?
> > >>
> > >> Any tips, ideas or sharing of experiences will be more than welcome
> and
> > >> I will gladly share the lessons learnt once I have the thing running.
> > >>
> > >> Hugs,
> > >> Julio
> > >>
> > >> --
> > >> XNG | Julio Garcés Teuber
> > >> Email: [email protected]
> > >> Skype: julio.xng
> > >> Tel: +54 (11) 4777.9488
> > >> Fax: +1 (320) 514.4271
> > >> http://www.xinergia.com/
> > >
> > > --
> > > *Lewis*
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
XNG | Julio Garcés Teuber
Email: [email protected]
Skype: julio.xng
Tel: +54 (11) 4777.9488
Fax: +1 (320) 514.4271
http://www.xinergia.com/

Re: Page deletion and tracking change between crawlings

Reply via email to