How is the hash calculated?
On Mon, Jul 20, 2009 at 1:41 AM, Shashikant Kore<[email protected]> wrote: > You may read about Google's approach for near-duplicates. > > http://www2007.org/papers/paper215.pdf > > The idea here is to reduce entire document to 64-bit sketch by > dimension reduction and the compare sketch of two documents to find > near-duplicates. The key property of the sketch is similar documents > produce similar sketch. So, if sketch for two documents differs in > less than k bits, they are near-duplicates. In their experiment, they > found k=3 yields best resuls. > > --shashi > > On Sat, Jul 18, 2009 at 12:56 AM, Jason > Rutherglen<[email protected]> wrote: >> I think this comes up fairly often in search apps, duplicate >> documents are indexed (for example using SimplyHired's search >> there are 20 of the same job listed from different websites). A >> similarity score above a threshold would determine the documents >> are too similar, are duplicates, and therefore can be removed. >> Is there a recommended Mahout algorithm for this? >> >
