Hi,
I'm actually working with nutch 0.8.1 crawler for an university project.
I need to crawl completely an intranet website.
Problem : Full crawl take times and resources.
I read on this mailing list many things on incremental crawling and i
confess that i don't understand everything.
At this link (
http://wiki.apache.org/nutch/Automating_Fetches_with_Python ) i see a
python script for incremental update but based on DB_unfetched flag. As
i crawl ALL my intranet site, i shouldn't have any unfetched page and so
this script should make nothing. am i wrong ?
At this link ( http://issues.apache.org/jira/browse/NUTCH-61 ) i see a
patch for 0.8.1 release of nutch witch allow to crawl only updated page
but i don't understand at all how it works or how to use nutch crawler
after applying this patch. In addition this patch is announced as
untested and unstable.
So the question is : is it actually possible to use nutch crawler to
make a crawl witch only download and index pages (html pdf doc etc)
witch have been updated since last crawl (based on the http protocol)
and how ?
Thanks,
Bonardo Pascal