On Oct 31, 2011, at 9:32 PM, Andrzej Bialecki wrote:

> similarity-preserving hash function was calculated on each sentence, and the 
> hash was added as a field. The property of the hash was that similar 
> documents (sentences) would produce a similar hash, with only some bit-level 
> perturbation. The challenge was to find a ranked list of possible duplicates 
> with similar (not exact same) hashes, which in this case meant to find a 
> ranked list of documents that have the smallest bit-level distance in their 
> hashes from the query hash.
> 
> The solution is described in SOLR-1918 - Bit-wise scoring field type.

In other words, a simhash, no?

Similarity Estimation Techniques from Rounding Algorithms
http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf

http://www.matpalm.com/resemblance/simhash/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to