I have an idea for a filter with a technique I used on another project. I 
thought I should share because this might be useful to someone.

Finding exact matches is an easy task but finding documents with small 
differences isnt. Google is using a technique which is very easy to implement 
and only costs 64 bit (if you choose this size). It is called simhash and is 
not to be confused with minhash (which Twitter is using).

In simple words simhash works like this:
1. you hash every token with a classic hash (like Murmur)
2. you sum up every first bit, second bit and so on (1 = +1 and 0 = -1)
3. now you have 64 numbers and for every positive number the corresponding 
simhash bit is 1 and for every zero and negative number the simhash bit is 0

Thats it. Now if you searching for a near duplicate document you only have to 
compare the simhashes and if it only differs in a small number of bits (the 
Google paper says a hamming distance of 3 in 64) its a near duplicate.

And there is more. There are things you could do, like giving stop words only a 
small weight when adding up. Giving synonyms the same (murmur) hash with a 
smaller weight

Links:
http://www.wwwconference.org/www2007/papers/paper215.pdf
http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf
http://www.matpalm.com/resemblance/simhash/

https://github.com/twitter/algebird

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/31fe6560-d066-481d-8b7a-7f51cab735a2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to