On Mar 25, 2010, at 11:09 AM, Joseph Stein wrote:

I have been researching ways to handle de-dupping data while running a
map/reduce program (so as to not re-calculate/re-aggregate data that
we have seen before[possibly months before]).

So roughly, your problem is that you have large amounts of historic data and you need to merge in the current month. The best solution that I've seen looks like:

Keep your historic data sorted by md5.

Run a MapReduce job to sort your new data into md5 order. Note that you want a total order, but because the md5's are evenly spaced across the key space this is easy. Basically, you pick a number of reduces (eg. 256) and then use the top N bits of the MD5 to pick your reduce. Since this job is only processing your new data, it is very fast.

Next you do a map-side join where each input split consists of an md5 range. The RecordReader reads from the historic and new datasets merging them as they go. (You can use the map-side join library for this.) Your map does the merge of the new and old. This is a map-only job, so it also is very fast.

Of course if the new data is small enough you can read all of the new input in each of the maps and just keep (and sort in ram) the new records that are in the right range and do the merge from ram. This lets you avoid the step where you sort the new data. This kind of merge optimization is where Pig and Hive hide lots of the details from the developer.

-- Owen

Reply via email to