It isn't a normal hash. The article describes the computation of a f-bit "simhash" computed by
a) extracting features from a document including (mostly) terms b) computing weights for the terms, most likely by some variant of IDF weighting c) computing an f-bit conventional hash for each feature d) summing these hashes, represented as positive and negative floating point values. 0's in the original hash of a single feature go to negative values, 1's go to positive values. The absolute magnitude of these values is the weight for the feature. e) reducing the summed hashes by taking just the sign bit of each floating point component of the sum. As the article says, this simhash has interesting properties. Small changes to the original document result in small changes to the simhash, but the simhashes of distinct documents have the normal hash properties. This is a very nice technique. Much better than the shingle based algorithms, it appears. On Mon, Jul 20, 2009 at 6:20 PM, Jason Rutherglen < [email protected]> wrote: > How is the hash calculated? > > On Mon, Jul 20, 2009 at 1:41 AM, Shashikant Kore<[email protected]> > wrote: > > You may read about Google's approach for near-duplicates. > > > > http://www2007.org/papers/paper215.pdf > -- Ted Dunning, CTO DeepDyve
