Intermediate data from the map phase is written to disk by the Mapper. After that, the data will be sent to Reducer(s) and it will perform 3 steps: - shuffle: where all output data from mappers are sorted as input to the Reducer(s); - sort: output data from mappers are grouped by key. This is done as these data are fetched and occurs together with shuffle; - reduce: it is when reduce() method in the Reducer class is called receiving the key and related values;
All the intermediate data assigned to an intermediate key is sent to the same reducer (in sort and shuffle phase), and developer can set the amount of Reducers to be ran in a given MR Job execution via Job.setNumReduceTasks(int). Increasing this number seems to benefit reducers balancing. You can find more info in http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html Regards, Wellington. 2012/1/12 Bai Shen <baishen.li...@gmail.com>: > Can someone explain how the map reduce merge is done? As far as I can tell, > it appears to pull all of the spill files into one giant file to send to the > reducer. Is this correct? Even if you set smaller spill files and a lower > sort factor, the eventual merge is still the same. It just takes more > passes to get there. > > Thanks.