Re: near-duplicates with simhash

Pere Ferrera Wed, 08 Jun 2011 11:33:01 -0700

(by the way, the idea is about duplicates and near-duplicates, not only
exact duplicates).


On Wed, Jun 8, 2011 at 8:16 PM, Pere Ferrera <[email protected]>wrote:

> Google's method funny thing is it allows you to find simhashes that differ
> at most in "k" bits (not only join by equal simhashes). They found k = 3 a
> good parameter for 64 bit simhashes and a large corpus of crawled pages. I'm
> guessing this can be potentially more useful (detecting more duplicates) but
> I'll need to dig a bit more into it.
>
> On Wed, Jun 8, 2011 at 7:51 PM, Elmer Garduno <[email protected]> wrote:
>
>> I've found an article,
>>
>> http://www.xcombinator.com/2011/05/09/cascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop/that
>> describes an implementation of simhash in MapReduce, the
>> implementation
>> is licensed under GPL v3
>>
>> Also, for short messages Twitter uses MinHashing and 4 byte signatures
>> before inserting to Lucene
>>
>> http://engineering.twitter.com/2011/05/engineering-behind-twitters-new-search.html
>>
>> On Wed, Jun 8, 2011 at 11:59 AM, Pere Ferrera <[email protected]
>> >wrote:
>>
>> > Hi guys,
>> >
>> > Looking back to some code I did in the past I was wondering if this
>> piece
>> > would be a good fit in the Mahout project.
>> >
>> > I implemented in Map/Reduce the idea of this Google's paper "detecting
>> > near-duplicates for web
>> > crawling<
>> >
>> http://www.google.es/url?sa=t&source=web&cd=1&ved=0CBwQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.78.7794%26rep%3Drep1%26type%3Dpdf&rct=j&q=detecting%20near-duplicates%20for%20web%20crawling&ei=DqfvTZykFpGwhAeAusSRCQ&usg=AFQjCNEeQnftMUXrnUwX3nJcN5hlt6tyjQ
>> > >"
>> > . Basically I'm computing a simhash for each document in the mapper and
>> > generating some permutations of it. Reducers compare in-memory simhashes
>> > belonging to the same permutation, with Hamming distance.
>> > It seems this idea has some key features:
>> > - It can be totally distributed since you can partition by permutation
>> ID +
>> > simhash prefix. The more reducers you use, the quicker everything will
>> be
>> > computed.
>> > - It is very efficient since the documents themselves are not shuffled,
>> > only
>> > simhashes are sent to the reduce phase.
>> >
>> > However its use is limited to huge datasets with modest-sized documents
>> > (not
>> > a good fit for short strings, for instance).
>> >
>> > I searched and found this JIRA:
>> > https://issues.apache.org/jira/browse/MAHOUT-365 and some conversations
>> (
>> >
>> >
>> http://mail-archives.apache.org/mod_mbox/mahout-dev/201003.mbox/%[email protected]%3E
>> > ).
>> > However it seems nothing's on the way?
>> >
>> > I used it for an experiment in the past for detecting duplicated
>> web-pages
>> > in Hadoop. I would need to work on further proper testing with big data
>> > sets
>> > to make it publicly available. So, I will appreciate your feedback on
>> this,
>> > and if you think it can be a good contribution, just tell me what are
>> the
>> > steps to follow.
>> >
>> > Thanks!
>> >
>> > Pere.
>> >
>>
>
>

Re: near-duplicates with simhash

Reply via email to