Re: Finding the similarity of documents using Mahout for deduplication

Ted Dunning Mon, 20 Jul 2009 18:53:22 -0700

It isn't a normal hash.

The article describes the computation of a f-bit "simhash" computed by

a) extracting features from a document including (mostly) terms

b) computing weights for the terms, most likely by some variant of IDF
weighting

c) computing an f-bit conventional hash for each feature

d) summing these hashes, represented as positive and negative floating point
values.  0's in the original hash of a single feature go to negative values,
1's go to positive values.  The absolute magnitude of these values is the
weight for the feature.

e) reducing the summed hashes by taking just the sign bit of each floating
point component of the sum.

As the article says, this simhash has interesting properties.  Small changes
to the original document result in small changes to the simhash, but the
simhashes of distinct documents have the normal hash properties.

This is a very nice technique.  Much better than the shingle based
algorithms, it appears.

On Mon, Jul 20, 2009 at 6:20 PM, Jason Rutherglen <
[email protected]> wrote:

> How is the hash calculated?
>
> On Mon, Jul 20, 2009 at 1:41 AM, Shashikant Kore<[email protected]>
> wrote:
> > You may read about Google's approach for near-duplicates.
> >
> > http://www2007.org/papers/paper215.pdf
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Finding the similarity of documents using Mahout for deduplication

Reply via email to