On Thu, Jun 9, 2011 at 1:15 AM, Ted Dunning <[email protected]> wrote:
> I think that it is an interesting idea to have an implementation of this. > Good. I guess I'll file a JIRA and submit a patch with it then. > > (btw... Ivan said you would say hi if you had been at buzzwords with him) > Yes, would have been interesting to go... But he won on "rock paper scissors" :) Fortunately, Apache Eurocon is in Barcelona this year so I have no excuses for that... On Wed, Jun 8, 2011 at 11:31 AM, Pere Ferrera <[email protected]> > wrote: > > (by the way, the idea is about duplicates and near-duplicates, not only > > exact duplicates). > > > > On Wed, Jun 8, 2011 at 8:16 PM, Pere Ferrera <[email protected] > >wrote: > > > >> Google's method funny thing is it allows you to find simhashes that > differ > >> at most in "k" bits (not only join by equal simhashes). They found k = 3 > a > >> good parameter for 64 bit simhashes and a large corpus of crawled pages. > I'm > >> guessing this can be potentially more useful (detecting more duplicates) > but > >> I'll need to dig a bit more into it. > >> > >> On Wed, Jun 8, 2011 at 7:51 PM, Elmer Garduno <[email protected]> > wrote: > >> > >>> I've found an article, > >>> > >>> > http://www.xcombinator.com/2011/05/09/cascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop/that > >>> describes an implementation of simhash in MapReduce, the > >>> implementation > >>> is licensed under GPL v3 > >>> > >>> Also, for short messages Twitter uses MinHashing and 4 byte signatures > >>> before inserting to Lucene > >>> > >>> > http://engineering.twitter.com/2011/05/engineering-behind-twitters-new-search.html > >>> > >>> On Wed, Jun 8, 2011 at 11:59 AM, Pere Ferrera < > [email protected] > >>> >wrote: > >>> > >>> > Hi guys, > >>> > > >>> > Looking back to some code I did in the past I was wondering if this > >>> piece > >>> > would be a good fit in the Mahout project. > >>> > > >>> > I implemented in Map/Reduce the idea of this Google's paper > "detecting > >>> > near-duplicates for web > >>> > crawling< > >>> > > >>> > http://www.google.es/url?sa=t&source=web&cd=1&ved=0CBwQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.78.7794%26rep%3Drep1%26type%3Dpdf&rct=j&q=detecting%20near-duplicates%20for%20web%20crawling&ei=DqfvTZykFpGwhAeAusSRCQ&usg=AFQjCNEeQnftMUXrnUwX3nJcN5hlt6tyjQ > >>> > >" > >>> > . Basically I'm computing a simhash for each document in the mapper > and > >>> > generating some permutations of it. Reducers compare in-memory > simhashes > >>> > belonging to the same permutation, with Hamming distance. > >>> > It seems this idea has some key features: > >>> > - It can be totally distributed since you can partition by > permutation > >>> ID + > >>> > simhash prefix. The more reducers you use, the quicker everything > will > >>> be > >>> > computed. > >>> > - It is very efficient since the documents themselves are not > shuffled, > >>> > only > >>> > simhashes are sent to the reduce phase. > >>> > > >>> > However its use is limited to huge datasets with modest-sized > documents > >>> > (not > >>> > a good fit for short strings, for instance). > >>> > > >>> > I searched and found this JIRA: > >>> > https://issues.apache.org/jira/browse/MAHOUT-365 and some > conversations > >>> ( > >>> > > >>> > > >>> > http://mail-archives.apache.org/mod_mbox/mahout-dev/201003.mbox/%[email protected]%3E > >>> > ). > >>> > However it seems nothing's on the way? > >>> > > >>> > I used it for an experiment in the past for detecting duplicated > >>> web-pages > >>> > in Hadoop. I would need to work on further proper testing with big > data > >>> > sets > >>> > to make it publicly available. So, I will appreciate your feedback on > >>> this, > >>> > and if you think it can be a good contribution, just tell me what are > >>> the > >>> > steps to follow. > >>> > > >>> > Thanks! > >>> > > >>> > Pere. > >>> > > >>> > >> > >> > > >
