I've found an article, http://www.xcombinator.com/2011/05/09/cascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop/that describes an implementation of simhash in MapReduce, the implementation is licensed under GPL v3
Also, for short messages Twitter uses MinHashing and 4 byte signatures before inserting to Lucene http://engineering.twitter.com/2011/05/engineering-behind-twitters-new-search.html On Wed, Jun 8, 2011 at 11:59 AM, Pere Ferrera <[email protected]>wrote: > Hi guys, > > Looking back to some code I did in the past I was wondering if this piece > would be a good fit in the Mahout project. > > I implemented in Map/Reduce the idea of this Google's paper "detecting > near-duplicates for web > crawling< > http://www.google.es/url?sa=t&source=web&cd=1&ved=0CBwQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.78.7794%26rep%3Drep1%26type%3Dpdf&rct=j&q=detecting%20near-duplicates%20for%20web%20crawling&ei=DqfvTZykFpGwhAeAusSRCQ&usg=AFQjCNEeQnftMUXrnUwX3nJcN5hlt6tyjQ > >" > . Basically I'm computing a simhash for each document in the mapper and > generating some permutations of it. Reducers compare in-memory simhashes > belonging to the same permutation, with Hamming distance. > It seems this idea has some key features: > - It can be totally distributed since you can partition by permutation ID + > simhash prefix. The more reducers you use, the quicker everything will be > computed. > - It is very efficient since the documents themselves are not shuffled, > only > simhashes are sent to the reduce phase. > > However its use is limited to huge datasets with modest-sized documents > (not > a good fit for short strings, for instance). > > I searched and found this JIRA: > https://issues.apache.org/jira/browse/MAHOUT-365 and some conversations ( > > http://mail-archives.apache.org/mod_mbox/mahout-dev/201003.mbox/%[email protected]%3E > ). > However it seems nothing's on the way? > > I used it for an experiment in the past for detecting duplicated web-pages > in Hadoop. I would need to work on further proper testing with big data > sets > to make it publicly available. So, I will appreciate your feedback on this, > and if you think it can be a good contribution, just tell me what are the > steps to follow. > > Thanks! > > Pere. >
