Re: near-duplicates with simhash

Pere Ferrera Thu, 09 Jun 2011 01:44:04 -0700

On Thu, Jun 9, 2011 at 1:15 AM, Ted Dunning <[email protected]> wrote:


> I think that it is an interesting idea to have an implementation of this.
>

Good. I guess I'll file a JIRA and submit a patch with it then.

>
> (btw... Ivan said you would say hi if you had been at buzzwords with him)
>

Yes, would have been interesting to go... But he won on "rock paper
scissors" :)
Fortunately, Apache Eurocon is in Barcelona this year so I have no excuses
for that...

On Wed, Jun 8, 2011 at 11:31 AM, Pere Ferrera <[email protected]>
> wrote:
> > (by the way, the idea is about duplicates and near-duplicates, not only
> > exact duplicates).
> >
> > On Wed, Jun 8, 2011 at 8:16 PM, Pere Ferrera <[email protected]
> >wrote:
> >
> >> Google's method funny thing is it allows you to find simhashes that
> differ
> >> at most in "k" bits (not only join by equal simhashes). They found k = 3
> a
> >> good parameter for 64 bit simhashes and a large corpus of crawled pages.
> I'm
> >> guessing this can be potentially more useful (detecting more duplicates)
> but
> >> I'll need to dig a bit more into it.
> >>
> >> On Wed, Jun 8, 2011 at 7:51 PM, Elmer Garduno <[email protected]>
> wrote:
> >>
> >>> I've found an article,
> >>>
> >>>
> http://www.xcombinator.com/2011/05/09/cascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop/that
> >>> describes an implementation of simhash in MapReduce, the
> >>> implementation
> >>> is licensed under GPL v3
> >>>
> >>> Also, for short messages Twitter uses MinHashing and 4 byte signatures
> >>> before inserting to Lucene
> >>>
> >>>
> http://engineering.twitter.com/2011/05/engineering-behind-twitters-new-search.html
> >>>
> >>> On Wed, Jun 8, 2011 at 11:59 AM, Pere Ferrera <
> [email protected]
> >>> >wrote:
> >>>
> >>> > Hi guys,
> >>> >
> >>> > Looking back to some code I did in the past I was wondering if this
> >>> piece
> >>> > would be a good fit in the Mahout project.
> >>> >
> >>> > I implemented in Map/Reduce the idea of this Google's paper
> "detecting
> >>> > near-duplicates for web
> >>> > crawling<
> >>> >
> >>>
> http://www.google.es/url?sa=t&source=web&cd=1&ved=0CBwQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.78.7794%26rep%3Drep1%26type%3Dpdf&rct=j&q=detecting%20near-duplicates%20for%20web%20crawling&ei=DqfvTZykFpGwhAeAusSRCQ&usg=AFQjCNEeQnftMUXrnUwX3nJcN5hlt6tyjQ
> >>> > >"
> >>> > . Basically I'm computing a simhash for each document in the mapper
> and
> >>> > generating some permutations of it. Reducers compare in-memory
> simhashes
> >>> > belonging to the same permutation, with Hamming distance.
> >>> > It seems this idea has some key features:
> >>> > - It can be totally distributed since you can partition by
> permutation
> >>> ID +
> >>> > simhash prefix. The more reducers you use, the quicker everything
> will
> >>> be
> >>> > computed.
> >>> > - It is very efficient since the documents themselves are not
> shuffled,
> >>> > only
> >>> > simhashes are sent to the reduce phase.
> >>> >
> >>> > However its use is limited to huge datasets with modest-sized
> documents
> >>> > (not
> >>> > a good fit for short strings, for instance).
> >>> >
> >>> > I searched and found this JIRA:
> >>> > https://issues.apache.org/jira/browse/MAHOUT-365 and some
> conversations
> >>> (
> >>> >
> >>> >
> >>>
> http://mail-archives.apache.org/mod_mbox/mahout-dev/201003.mbox/%[email protected]%3E
> >>> > ).
> >>> > However it seems nothing's on the way?
> >>> >
> >>> > I used it for an experiment in the past for detecting duplicated
> >>> web-pages
> >>> > in Hadoop. I would need to work on further proper testing with big
> data
> >>> > sets
> >>> > to make it publicly available. So, I will appreciate your feedback on
> >>> this,
> >>> > and if you think it can be a good contribution, just tell me what are
> >>> the
> >>> > steps to follow.
> >>> >
> >>> > Thanks!
> >>> >
> >>> > Pere.
> >>> >
> >>>
> >>
> >>
> >
>

Re: near-duplicates with simhash

Reply via email to