Re: near-duplicates with simhash

Pere Ferrera Wed, 08 Jun 2011 11:17:24 -0700

Google's method funny thing is it allows you to find simhashes that differ
at most in "k" bits (not only join by equal simhashes). They found k = 3 a
good parameter for 64 bit simhashes and a large corpus of crawled pages. I'm
guessing this can be potentially more useful (detecting more duplicates) but
I'll need to dig a bit more into it.


On Wed, Jun 8, 2011 at 7:51 PM, Elmer Garduno <[email protected]> wrote:

> I've found an article,
>
> http://www.xcombinator.com/2011/05/09/cascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop/that
> describes an implementation of simhash in MapReduce, the
> implementation
> is licensed under GPL v3
>
> Also, for short messages Twitter uses MinHashing and 4 byte signatures
> before inserting to Lucene
>
> http://engineering.twitter.com/2011/05/engineering-behind-twitters-new-search.html
>
> On Wed, Jun 8, 2011 at 11:59 AM, Pere Ferrera <[email protected]
> >wrote:
>
> > Hi guys,
> >
> > Looking back to some code I did in the past I was wondering if this piece
> > would be a good fit in the Mahout project.
> >
> > I implemented in Map/Reduce the idea of this Google's paper "detecting
> > near-duplicates for web
> > crawling<
> >
> http://www.google.es/url?sa=t&source=web&cd=1&ved=0CBwQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.78.7794%26rep%3Drep1%26type%3Dpdf&rct=j&q=detecting%20near-duplicates%20for%20web%20crawling&ei=DqfvTZykFpGwhAeAusSRCQ&usg=AFQjCNEeQnftMUXrnUwX3nJcN5hlt6tyjQ
> > >"
> > . Basically I'm computing a simhash for each document in the mapper and
> > generating some permutations of it. Reducers compare in-memory simhashes
> > belonging to the same permutation, with Hamming distance.
> > It seems this idea has some key features:
> > - It can be totally distributed since you can partition by permutation ID
> +
> > simhash prefix. The more reducers you use, the quicker everything will be
> > computed.
> > - It is very efficient since the documents themselves are not shuffled,
> > only
> > simhashes are sent to the reduce phase.
> >
> > However its use is limited to huge datasets with modest-sized documents
> > (not
> > a good fit for short strings, for instance).
> >
> > I searched and found this JIRA:
> > https://issues.apache.org/jira/browse/MAHOUT-365 and some conversations
> (
> >
> >
> http://mail-archives.apache.org/mod_mbox/mahout-dev/201003.mbox/%[email protected]%3E
> > ).
> > However it seems nothing's on the way?
> >
> > I used it for an experiment in the past for detecting duplicated
> web-pages
> > in Hadoop. I would need to work on further proper testing with big data
> > sets
> > to make it publicly available. So, I will appreciate your feedback on
> this,
> > and if you think it can be a good contribution, just tell me what are the
> > steps to follow.
> >
> > Thanks!
> >
> > Pere.
> >
>

Re: near-duplicates with simhash

Reply via email to