[Nutch-dev] Incremental Crawling / Revisting Pages

Jack Tang Tue, 13 Sep 2005 00:48:01 -0700

Hi

There is wonderful discussion in Heritrix mailist. I cannot help
FWDing some information here. And hope it helps for nutch


---------------------------------------------------------------------------------------------------------
Dennis Hotson wrote:

> I'm just wondering whether anyone has written a filter or module to do
> incremental crawling.

You've see the AdaptiveRevisitingFrontier Frontier?  Its described in
outline here, http://crawler.archive.org/articles/user_manual.html#arf,
and in detail, here: http://vefsofnun.bok.hi.is/thesis/ar.pdf.

> What I mean is something that will do a HEAD request on pages and then
> only fetch the actual content if the page has been updated (newer last-
> modified date or similar). This technique saves a lot of bandwidth and
> can speed up crawling for sites that aren't updated very often.
>
> I've written a proof of concept filter class that does this (well
> actually, it's not quite working yet).

How does your filter work?

St.Ack

>
> If somebody else has already solved this problem it would save me a lot
> of effort. Thanks! :D
>
> Cheers,
> Dennis
>
>
>
>

Regards
/Jack


-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Incremental Crawling / Revisting Pages

Reply via email to