nutch crawl - incremental update

Bonardo Pascal Mon, 12 Mar 2007 17:08:18 -0800

Hi,

I'm actually working with nutch 0.8.1 crawler for an university project.I need to crawl completely an intranet website.


Problem : Full crawl take times and resources.

I read on this mailing list many things on incremental crawling and iconfess that i don't understand everything.

At this link (http://wiki.apache.org/nutch/Automating_Fetches_with_Python ) i see apython script for incremental update but based on DB_unfetched flag. Asi crawl ALL my intranet site, i shouldn't have any unfetched page and sothis script should make nothing. am i wrong ?

At this link ( http://issues.apache.org/jira/browse/NUTCH-61 ) i see apatch for 0.8.1 release of nutch witch allow to crawl only updated pagebut i don't understand at all how it works or how to use nutch crawlerafter applying this patch. In addition this patch is announced asuntested and unstable.

So the question is : is it actually possible to use nutch crawler tomake a crawl witch only download and index pages (html pdf doc etc)witch have been updated since last crawl (based on the http protocol)and how ?


Thanks,

Bonardo Pascal

nutch crawl - incremental update

Reply via email to