near-duplicates with simhash

Pere Ferrera Wed, 08 Jun 2011 10:00:41 -0700

Hi guys,

Looking back to some code I did in the past I was wondering if this piece
would be a good fit in the Mahout project.


I implemented in Map/Reduce the idea of this Google's paper "detecting
near-duplicates for web
crawling<http://www.google.es/url?sa=t&source=web&cd=1&ved=0CBwQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.78.7794%26rep%3Drep1%26type%3Dpdf&rct=j&q=detecting%20near-duplicates%20for%20web%20crawling&ei=DqfvTZykFpGwhAeAusSRCQ&usg=AFQjCNEeQnftMUXrnUwX3nJcN5hlt6tyjQ>"
. Basically I'm computing a simhash for each document in the mapper and
generating some permutations of it. Reducers compare in-memory simhashes
belonging to the same permutation, with Hamming distance.
It seems this idea has some key features:
- It can be totally distributed since you can partition by permutation ID +
simhash prefix. The more reducers you use, the quicker everything will be
computed.
- It is very efficient since the documents themselves are not shuffled, only
simhashes are sent to the reduce phase.

However its use is limited to huge datasets with modest-sized documents (not
a good fit for short strings, for instance).

I searched and found this JIRA:
https://issues.apache.org/jira/browse/MAHOUT-365 and some conversations (
http://mail-archives.apache.org/mod_mbox/mahout-dev/201003.mbox/%[email protected]%3E).
However it seems nothing's on the way?

I used it for an experiment in the past for detecting duplicated web-pages
in Hadoop. I would need to work on further proper testing with big data sets
to make it publicly available. So, I will appreciate your feedback on this,
and if you think it can be a good contribution, just tell me what are the
steps to follow.

Thanks!

Pere.

near-duplicates with simhash

Reply via email to