Michael Ji wrote:
Hi Jon:

You have an interesting approach.
We are in the similar effort to save the unneccessary
indexing and data duplication for the pages with the
same content since last successful fetching.
I am thinking to add an extra data field in
"fetchlist" data structure, which contained  content
MD5 hashing value for the previous fetching.

If the current fetching step gets same content, I will
skip parsing and indexing process.

Please see the patches in http://issues.apache.org/jira/browse/NUTCH-61 .


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to