Re: near-duplicates with simhash

Ted Dunning Wed, 08 Jun 2011 16:17:12 -0700

I think that it is an interesting idea to have an implementation of this.

(btw... Ivan said you would say hi if you had been at buzzwords with him)


On Wed, Jun 8, 2011 at 11:31 AM, Pere Ferrera <[email protected]> wrote:
> (by the way, the idea is about duplicates and near-duplicates, not only
> exact duplicates).
>
> On Wed, Jun 8, 2011 at 8:16 PM, Pere Ferrera <[email protected]>wrote:
>
>> Google's method funny thing is it allows you to find simhashes that differ
>> at most in "k" bits (not only join by equal simhashes). They found k = 3 a
>> good parameter for 64 bit simhashes and a large corpus of crawled pages. I'm
>> guessing this can be potentially more useful (detecting more duplicates) but
>> I'll need to dig a bit more into it.
>>
>> On Wed, Jun 8, 2011 at 7:51 PM, Elmer Garduno <[email protected]> wrote:
>>
>>> I've found an article,
>>>
>>> http://www.xcombinator.com/2011/05/09/cascading-simhash-a-library-to-cluster-by-minhashes-in-hadoop/that
>>> describes an implementation of simhash in MapReduce, the
>>> implementation
>>> is licensed under GPL v3
>>>
>>> Also, for short messages Twitter uses MinHashing and 4 byte signatures
>>> before inserting to Lucene
>>>
>>> http://engineering.twitter.com/2011/05/engineering-behind-twitters-new-search.html
>>>
>>> On Wed, Jun 8, 2011 at 11:59 AM, Pere Ferrera <[email protected]
>>> >wrote:
>>>
>>> > Hi guys,
>>> >
>>> > Looking back to some code I did in the past I was wondering if this
>>> piece
>>> > would be a good fit in the Mahout project.
>>> >
>>> > I implemented in Map/Reduce the idea of this Google's paper "detecting
>>> > near-duplicates for web
>>> > crawling<
>>> >
>>> http://www.google.es/url?sa=t&source=web&cd=1&ved=0CBwQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.78.7794%26rep%3Drep1%26type%3Dpdf&rct=j&q=detecting%20near-duplicates%20for%20web%20crawling&ei=DqfvTZykFpGwhAeAusSRCQ&usg=AFQjCNEeQnftMUXrnUwX3nJcN5hlt6tyjQ
>>> > >"
>>> > . Basically I'm computing a simhash for each document in the mapper and
>>> > generating some permutations of it. Reducers compare in-memory simhashes
>>> > belonging to the same permutation, with Hamming distance.
>>> > It seems this idea has some key features:
>>> > - It can be totally distributed since you can partition by permutation
>>> ID +
>>> > simhash prefix. The more reducers you use, the quicker everything will
>>> be
>>> > computed.
>>> > - It is very efficient since the documents themselves are not shuffled,
>>> > only
>>> > simhashes are sent to the reduce phase.
>>> >
>>> > However its use is limited to huge datasets with modest-sized documents
>>> > (not
>>> > a good fit for short strings, for instance).
>>> >
>>> > I searched and found this JIRA:
>>> > https://issues.apache.org/jira/browse/MAHOUT-365 and some conversations
>>> (
>>> >
>>> >
>>> http://mail-archives.apache.org/mod_mbox/mahout-dev/201003.mbox/%[email protected]%3E
>>> > ).
>>> > However it seems nothing's on the way?
>>> >
>>> > I used it for an experiment in the past for detecting duplicated
>>> web-pages
>>> > in Hadoop. I would need to work on further proper testing with big data
>>> > sets
>>> > to make it publicly available. So, I will appreciate your feedback on
>>> this,
>>> > and if you think it can be a good contribution, just tell me what are
>>> the
>>> > steps to follow.
>>> >
>>> > Thanks!
>>> >
>>> > Pere.
>>> >
>>>
>>
>>
>

Re: near-duplicates with simhash

Reply via email to