DeDuplication Techniques

Joseph Stein Thu, 25 Mar 2010 11:09:35 -0700

I have been researching ways to handle de-dupping data while running a
map/reduce program (so as to not re-calculate/re-aggregate data that
we have seen before[possibly months before]).


The data sets we have are littered with repeats of data from mobile
devices which continue to come in over time (so we may see duplicates
of data re-posted months after it originally posted...)

I have 2 ways so far I can go about it (one way I do in production
without Hadoop) and interested to see if others have faced/solved this
in Hadoop/HDFS and what their experience might be.

1) handle my own hash filter (where I continually store and look up a
hash (MD5, bloom, whatever) of the data I am aggregating on as
existing already).  We do this now without Hadoop perhaps a variant
can be ported into HDFS as map task, reducing the results to files and
restoring the hash table (maybe in Hive or something, dunno yet)
2) push the data into Cassandra (our NoSQL solution of choice) and let
that hash/map system do it for us.   As I get more into Hadoop looking
at HBas is tempting but then just one more thing to learn.

I would really like to not have to reinvent a wheel here and even
contribute if something is going on as it is a use case in our work
effort.

Thanx in advance =8^)  Apologize I posted this on common dev yesterday
by accident (so this is not a repost spam but appropriate for this
list)

Cheers.

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/

DeDuplication Techniques

Reply via email to