You may read about Google's approach for near-duplicates. http://www2007.org/papers/paper215.pdf
The idea here is to reduce entire document to 64-bit sketch by dimension reduction and the compare sketch of two documents to find near-duplicates. The key property of the sketch is similar documents produce similar sketch. So, if sketch for two documents differs in less than k bits, they are near-duplicates. In their experiment, they found k=3 yields best resuls. --shashi On Sat, Jul 18, 2009 at 12:56 AM, Jason Rutherglen<[email protected]> wrote: > I think this comes up fairly often in search apps, duplicate > documents are indexed (for example using SimplyHired's search > there are 20 of the same job listed from different websites). A > similarity score above a threshold would determine the documents > are too similar, are duplicates, and therefore can be removed. > Is there a recommended Mahout algorithm for this? >
