Re: Finding the similarity of documents using Mahout for deduplication

Shashikant Kore Mon, 20 Jul 2009 01:42:09 -0700

You may read about Google's approach for near-duplicates.

http://www2007.org/papers/paper215.pdf

The idea here is to reduce entire document to 64-bit sketch by
dimension reduction and the compare sketch of two documents to find
near-duplicates. The key property of the sketch is similar documents
produce similar sketch.  So, if sketch for two documents differs in
less than k bits, they are near-duplicates. In their experiment, they
found k=3 yields best resuls.

--shashi

On Sat, Jul 18, 2009 at 12:56 AM, Jason
Rutherglen<[email protected]> wrote:
> I think this comes up fairly often in search apps, duplicate
> documents are indexed (for example using SimplyHired's search
> there are 20 of the same job listed from different websites). A
> similarity score above a threshold would determine the documents
> are too similar, are duplicates, and therefore can be removed.
> Is there a recommended Mahout algorithm for this?
>

Re: Finding the similarity of documents using Mahout for deduplication

Reply via email to