Re: DeDuplication Techniques

Owen O'Malley Fri, 26 Mar 2010 09:06:37 -0700


On Mar 25, 2010, at 11:09 AM, Joseph Stein wrote:

I have been researching ways to handle de-dupping data while running a
map/reduce program (so as to not re-calculate/re-aggregate data that
we have seen before[possibly months before]).

So roughly, your problem is that you have large amounts of historicdata and you need to merge in the current month. The best solutionthat I've seen looks like:


Keep your historic data sorted by md5.

Run a MapReduce job to sort your new data into md5 order. Note thatyou want a total order, but because the md5's are evenly spaced acrossthe key space this is easy. Basically, you pick a number of reduces(eg. 256) and then use the top N bits of the MD5 to pick your reduce.Since this job is only processing your new data, it is very fast.

Next you do a map-side join where each input split consists of an md5range. The RecordReader reads from the historic and new datasetsmerging them as they go. (You can use the map-side join library forthis.) Your map does the merge of the new and old. This is a map-onlyjob, so it also is very fast.

Of course if the new data is small enough you can read all of the newinput in each of the maps and just keep (and sort in ram) the newrecords that are in the right range and do the merge from ram. Thislets you avoid the step where you sort the new data. This kind ofmerge optimization is where Pig and Hive hide lots of the details fromthe developer.


-- Owen

Re: DeDuplication Techniques

Reply via email to