Re: Fetcher, Query Strings,and Duplicate Hashes (Nutch 0.7)

Andrzej Bialecki Thu, 25 Aug 2005 00:08:53 -0700

Michael Ji wrote:

Hi Jon:

You have an interesting approach.

We are in the similar effort to save the unneccessary
indexing and data duplication for the pages with the

same content since last successful fetching.

I am thinking to add an extra data field in
"fetchlist" data structure, which contained  content
MD5 hashing value for the previous fetching.

If the current fetching step gets same content, I will
skip parsing and indexing process.


Please see the patches in http://issues.apache.org/jira/browse/NUTCH-61 .


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Fetcher, Query Strings,and Duplicate Hashes (Nutch 0.7)

Reply via email to