I have been researching ways to handle de-dupping data while running a map/reduce program (so as to not re-calculate/re-aggregate data that we have seen before[possibly months before]).
The data sets we have are littered with repeats of data from mobile devices which continue to come in over time (so we may see duplicates of data re-posted months after it originally posted...) I have 2 ways so far I can go about it (one way I do in production without Hadoop) and interested to see if others have faced/solved this in Hadoop/HDFS and what their experience might be. 1) handle my own hash filter (where I continually store and look up a hash (MD5, bloom, whatever) of the data I am aggregating on as existing already). We do this now without Hadoop perhaps a variant can be ported into HDFS as map task, reducing the results to files and restoring the hash table (maybe in Hive or something, dunno yet) 2) push the data into Cassandra (our NoSQL solution of choice) and let that hash/map system do it for us. As I get more into Hadoop looking at HBas is tempting but then just one more thing to learn. I would really like to not have to reinvent a wheel here and even contribute if something is going on as it is a use case in our work effort. Thanx in advance =8^) Apologize I posted this on common dev yesterday by accident (so this is not a repost spam but appropriate for this list) Cheers. /* Joe Stein http://www.linkedin.com/in/charmalloc */