Michael Ji wrote:
Hi Jon:
You have an interesting approach.
We are in the similar effort to save the unneccessary
indexing and data duplication for the pages with the
same content since last successful fetching.
I am thinking to add an extra data field in
"fetchlist" data structure, which contained content
MD5 hashing value for the previous fetching.
If the current fetching step gets same content, I will
skip parsing and indexing process.
Please see the patches in http://issues.apache.org/jira/browse/NUTCH-61 .
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com