Re: Page deletion and tracking change between crawlings

lewis john mcgibbney Wed, 27 Jul 2011 13:34:54 -0700

Hi Julio,

The algorithm you are referring to is called the adaptive fetching interval
in Nutch. There is some basic reading on this in nutch-default.xml and
should also be a good deal on the user@ list (as well as dev@). If you
require more information on this then please say however I'm sure you should
be able to suss it out.

Information on your post processing is quite vague and doesn't give much
indication of exactly what data we need in order to streamline your post
processing activity, for clarity is it possible to expand upon what you
provided?

I know this sounds really basic, but you could do a dump of your crawldb
before and after for comparison or similarity analysis. This way we would
find the status of URLs in crawldb as well as when they were last fetched
and whether or not then have been updated since last crawl.

Solr clean will remove various pages you mention, as a method for reflecting
an accurate representation of the web graph in your index, however, again I
am not entirelty sure if we have a method for determining exactly which
pages were removed, however we do get log output telling us how many pages
were removed.

On Wed, Jul 27, 2011 at 3:44 PM, Julio Garcés Teuber <[email protected]>wrote:

> I have configured a *Nutch* instance to continuously crawl a particular
> site. I have successfully managed to get the website data in bulk and base
> on that to post process that information.
>
> The problem I'm facing now is that every time I run the crawling process I
> have to post process the whole site. I want to optimize the post processing
> and in order to do so I need to get from *Nutch* the list of pages that
> have changed since the last crawling process was run.
>
> What's the best way to do that? Is there already a mechanism in *Nutch*that 
> keeps track of the last time the content of a page has changed? Do have
> to create an registry of crawled pages with and md5 for example and keep
> track of changes my self? Aside tracking which pages have changed I also
> need to track which pages have been removed since last crawl. Is there a
> specific mechanism to track removed pages (i.e. 404, 301, 302 HTTP codes)?
>
> Any tips, ideas or sharing of experiences will be more than welcome and I
> will gladly share the lessons learnt once I have the thing running.
>
> Hugs,
> Julio
>
> --
> XNG | Julio Garcés Teuber
> Email: [email protected]
> Skype: julio.xng
> Tel: +54 (11) 4777.9488
> Fax: +1 (320) 514.4271
> http://www.xinergia.com/
>

-- 
*Lewis*

Re: Page deletion and tracking change between crawlings

Reply via email to