On Oct 31, 2011, at 9:32 PM, Andrzej Bialecki wrote: > similarity-preserving hash function was calculated on each sentence, and the > hash was added as a field. The property of the hash was that similar > documents (sentences) would produce a similar hash, with only some bit-level > perturbation. The challenge was to find a ranked list of possible duplicates > with similar (not exact same) hashes, which in this case meant to find a > ranked list of documents that have the smallest bit-level distance in their > hashes from the query hash. > > The solution is described in SOLR-1918 - Bit-wise scoring field type.
In other words, a simhash, no? Similarity Estimation Techniques from Rounding Algorithms http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf http://www.matpalm.com/resemblance/simhash/ --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org