Dennis Kubes wrote:
If you are using more than one index then dedup will not work across indexes.

This is incorrect. DeleteDuplicates works just fine with multiple indexes, assuming you process all indexes in the same run of DeleteDuplicates, so that it has a global view of all input indexes.

  A single index should dedup correctly unless the pages are not
exact duplicates but near duplicates. The dedup process works on url and byte hash. If the content is even 1 byte different, it doesn't work.

This depends on the implementation of Signature. Indeed, the default MD5HashSignature works this way.


Near duplicate detection is another set of algorithms that haven't been implemented in Nutch yet.

Well, the existing TextProfileSignature can be used as a form of (crude) near-duplicate detection, precisely because it is tolerant to small changes in the input text.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to