Hello Valentin ,
Thanks for this suggestion.
We might find this useful in some of our projects.
Thanks
Vineeth
On Tue, Jul 29, 2014 at 2:42 AM, Valentin <[email protected]> wrote:
> I have an idea for a filter with a technique I used on another project. I
> thought I should share because this might be useful to someone.
>
> Finding exact matches is an easy task but finding documents with small
> differences isnt. Google is using a technique which is very easy to
> implement and only costs 64 bit (if you choose this size). It is called
> simhash and is not to be confused with minhash (which Twitter is using).
>
> In simple words simhash works like this:
> 1. you hash every token with a classic hash (like Murmur)
> 2. you sum up every first bit, second bit and so on (1 = +1 and 0 = -1)
> 3. now you have 64 numbers and for every positive number the corresponding
> simhash bit is 1 and for every zero and negative number the simhash bit is 0
>
> Thats it. Now if you searching for a near duplicate document you only have
> to compare the simhashes and if it only differs in a small number of bits
> (the Google paper says a hamming distance of 3 in 64) its a near duplicate.
>
> And there is more. There are things you could do, like giving stop words
> only a small weight when adding up. Giving synonyms the same (murmur) hash
> with a smaller weight
>
> Links:
> http://www.wwwconference.org/www2007/papers/paper215.pdf
>
> http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf
> http://www.matpalm.com/resemblance/simhash/
>
> https://github.com/twitter/algebird
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/31fe6560-d066-481d-8b7a-7f51cab735a2%40googlegroups.com
> .
> For more options, visit https://groups.google.com/d/optout.
>
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAGdPd5%3Doqym%3D2y24rijgWekk2z5%2BZCYX%2BKg3UcCfet3-Lh%2BRjg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.