Re: pages with duplicate content in search results

Andrzej Bialecki Thu, 25 Sep 2008 14:55:36 -0700

Edward Quick wrote:

Dennis Kubes wrote:
If you are using more than one index then dedup will not work acrossindexes.
This is incorrect. DeleteDuplicates works just fine with multipleindexes, assuming you process all indexes in the same run ofDeleteDuplicates, so that it has a global view of all input indexes.
   A single index should dedup correctly unless the pages are not
exact duplicates but near duplicates. The dedup process works on urland byte hash. If the content is even 1 byte different, it doesn't work.
This depends on the implementation of Signature. Indeed, the defaultMD5HashSignature works this way.
Near duplicate detection is another set of algorithms that haven't beenimplemented in Nutch yet.
Well, the existing TextProfileSignature can be used as a form of (crude)near-duplicate detection, precisely because it is tolerant to smallchanges in the input text.
Thanks Andrzej.
How do you tell Nutch to use the TextProfileSignature instead of 
MD5HashSignature for deduplicating?


See the following property in your nutch-site.xml: db.signature.class.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: pages with duplicate content in search results

Reply via email to