Re: near-duplicates with simhash

Elmer Garduno Wed, 08 Jun 2011 10:53:11 -0700

I've found an article,
http://www.xcombinator.com/2011/05/09/cascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop/that
describes an implementation of simhash in MapReduce, the
implementation
is licensed under GPL v3


Also, for short messages Twitter uses MinHashing and 4 byte signatures
before inserting to Lucene
http://engineering.twitter.com/2011/05/engineering-behind-twitters-new-search.html

On Wed, Jun 8, 2011 at 11:59 AM, Pere Ferrera <[email protected]>wrote:

> Hi guys,
>
> Looking back to some code I did in the past I was wondering if this piece
> would be a good fit in the Mahout project.
>
> I implemented in Map/Reduce the idea of this Google's paper "detecting
> near-duplicates for web
> crawling<
> http://www.google.es/url?sa=t&source=web&cd=1&ved=0CBwQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.78.7794%26rep%3Drep1%26type%3Dpdf&rct=j&q=detecting%20near-duplicates%20for%20web%20crawling&ei=DqfvTZykFpGwhAeAusSRCQ&usg=AFQjCNEeQnftMUXrnUwX3nJcN5hlt6tyjQ
> >"
> . Basically I'm computing a simhash for each document in the mapper and
> generating some permutations of it. Reducers compare in-memory simhashes
> belonging to the same permutation, with Hamming distance.
> It seems this idea has some key features:
> - It can be totally distributed since you can partition by permutation ID +
> simhash prefix. The more reducers you use, the quicker everything will be
> computed.
> - It is very efficient since the documents themselves are not shuffled,
> only
> simhashes are sent to the reduce phase.
>
> However its use is limited to huge datasets with modest-sized documents
> (not
> a good fit for short strings, for instance).
>
> I searched and found this JIRA:
> https://issues.apache.org/jira/browse/MAHOUT-365 and some conversations (
>
> http://mail-archives.apache.org/mod_mbox/mahout-dev/201003.mbox/%[email protected]%3E
> ).
> However it seems nothing's on the way?
>
> I used it for an experiment in the past for detecting duplicated web-pages
> in Hadoop. I would need to work on further proper testing with big data
> sets
> to make it publicly available. So, I will appreciate your feedback on this,
> and if you think it can be a good contribution, just tell me what are the
> steps to follow.
>
> Thanks!
>
> Pere.
>

Re: near-duplicates with simhash

Reply via email to