After observing the speed differential on my 200 node cluster between the fast and slow nodes as seen in HADOOP-253, I wanted to try running more smaller reduces. So instead of the default 400 reduces, which could all start at the beginning of execution, I used 700 instead. This means that all of the nodes run at least 2 reduces and the fastest 150 nodes run an additional 2 reduces each. Clearly, in a case where all of the nodes are well balanced, that will lose because the second round of data shuffling doesn't overlap the maps. However, in the presence of failures or uneven hardware, it will be a win.

That brought down my run time on sorting my 2010 gigabyte dataset from 8.5 hours to 6.6 hours. For those of you who are keeping score, that means that at the start of the month the sort benchmark was taking 47 hours and is now taking 6.6 hours on the same hardware.

Note that it would have also made sense to double the block size on the inputs, so that the size of data on each of the M*R data paths stays constant, but I wanted to try the changes independently. As for my other config choices, the only non-default ones are:

dfs.block.size=134217728
io.sort.factor=100
io.file.buffer.size=65536
mapred.reduce.parallel.copies=10

I'm also looking forward to trying out Ben Reed's patches to reduce the number of trips to disk in the reduces.

-- Owen

Reply via email to