RE: pages with duplicate content in search results

Edward Quick Thu, 25 Sep 2008 14:45:58 -0700


> 
> Dennis Kubes wrote:
> > If you are using more than one index then dedup will not work across 
> > indexes.
> 
> This is incorrect. DeleteDuplicates works just fine with multiple 
> indexes, assuming you process all indexes in the same run of 
> DeleteDuplicates, so that it has a global view of all input indexes.
> 
>    A single index should dedup correctly unless the pages are not
> > exact duplicates but near duplicates.  The dedup process works on url 
> > and byte hash.  If the content is even 1 byte different, it doesn't work.
> 
> This depends on the implementation of Signature. Indeed, the default 
> MD5HashSignature works this way.
> 
> > 
> > Near duplicate detection is another set of algorithms that haven't been 
> > implemented in Nutch yet.
> 
> Well, the existing TextProfileSignature can be used as a form of (crude) 
> near-duplicate detection, precisely because it is tolerant to small 
> changes in the input text.


Thanks Andrzej.
How do you tell Nutch to use the TextProfileSignature instead of 
MD5HashSignature for deduplicating?

> 
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 

_________________________________________________________________
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/

RE: pages with duplicate content in search results

Reply via email to