On Mar 25, 2010, at 11:09 AM, Joseph Stein wrote:
I have been researching ways to handle de-dupping data while running a
map/reduce program (so as to not re-calculate/re-aggregate data that
we have seen before[possibly months before]).
So roughly, your problem is that you have large amounts of historic
data and you need to merge in the current month. The best solution
that I've seen looks like:
Keep your historic data sorted by md5.
Run a MapReduce job to sort your new data into md5 order. Note that
you want a total order, but because the md5's are evenly spaced across
the key space this is easy. Basically, you pick a number of reduces
(eg. 256) and then use the top N bits of the MD5 to pick your reduce.
Since this job is only processing your new data, it is very fast.
Next you do a map-side join where each input split consists of an md5
range. The RecordReader reads from the historic and new datasets
merging them as they go. (You can use the map-side join library for
this.) Your map does the merge of the new and old. This is a map-only
job, so it also is very fast.
Of course if the new data is small enough you can read all of the new
input in each of the maps and just keep (and sort in ram) the new
records that are in the right range and do the merge from ram. This
lets you avoid the step where you sort the new data. This kind of
merge optimization is where Pig and Hive hide lots of the details from
the developer.
-- Owen