On 31/10/2011 21:42, Petite Abeille wrote:

On Oct 31, 2011, at 9:32 PM, Andrzej Bialecki wrote:

similarity-preserving hash function was calculated on each sentence, and the 
hash was added as a field. The property of the hash was that similar documents 
(sentences) would produce a similar hash, with only some bit-level 
perturbation. The challenge was to find a ranked list of possible duplicates 
with similar (not exact same) hashes, which in this case meant to find a ranked 
list of documents that have the smallest bit-level distance in their hashes 
from the query hash.

The solution is described in SOLR-1918 - Bit-wise scoring field type.

In other words, a simhash, no?

Similarity Estimation Techniques from Rounding Algorithms
http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf

http://www.matpalm.com/resemblance/simhash/

Yes, you could use this. In that project we used a different application-specific hash.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to