Subhojit Roy wrote:
Hi,
Would it be possible to include in Nutch, the ability to crawl & download a
page only if the page has been updated since the last crawl? I had read
sometime back that there were plans to include such a feature. It would be a
very useful feature to have IMO. This of course depends on the "last
modified" timestamp being present on the webpage that is being crawled,
which I believe is not mandatory. Still those who do set it would benefit.
This is already implemented - see the Signature / MD5Signature /
TextProfileSignature.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com