Re: Finding the similarity of documents using Mahout for deduplication

Jason Rutherglen Mon, 20 Jul 2009 18:20:59 -0700

How is the hash calculated?


On Mon, Jul 20, 2009 at 1:41 AM, Shashikant Kore<[email protected]> wrote:
> You may read about Google's approach for near-duplicates.
>
> http://www2007.org/papers/paper215.pdf
>
> The idea here is to reduce entire document to 64-bit sketch by
> dimension reduction and the compare sketch of two documents to find
> near-duplicates. The key property of the sketch is similar documents
> produce similar sketch.  So, if sketch for two documents differs in
> less than k bits, they are near-duplicates. In their experiment, they
> found k=3 yields best resuls.
>
> --shashi
>
> On Sat, Jul 18, 2009 at 12:56 AM, Jason
> Rutherglen<[email protected]> wrote:
>> I think this comes up fairly often in search apps, duplicate
>> documents are indexed (for example using SimplyHired's search
>> there are 20 of the same job listed from different websites). A
>> similarity score above a threshold would determine the documents
>> are too similar, are duplicates, and therefore can be removed.
>> Is there a recommended Mahout algorithm for this?
>>
>

Re: Finding the similarity of documents using Mahout for deduplication

Reply via email to