Re: Hadoop map reduce merge algorithm

Ravi Gummadi Thu, 12 Jan 2012 20:16:08 -0800

Yes. Spills of map output get merged to single file. The spills are triggered 
by the buffer size set using the configuration property io.sort.mb. Obviously 
bigger value for io.sort.mb is preferred for better performance --- but the 
limit is to be set based on the amount of RAM available.
Also, the bigger the value for the configuration property io.sort.factor the 
better in terms of performance. Even in this case, smaller value may have to be 
set for this config property based on the size of RAM available.


-Ravi

On 1/13/12   Friday 3:12 AM, "Bai Shen" <baishen.li...@gmail.com> wrote:

That's my understanding as well.  I can't seem to find any settings that govern 
the step where the output is merged into a single file.  io.sort.factor 
modifies the number of passes that is done, but it eventually ends up doing the 
same thing no matter how many spill files there are.  They're simply combined 
incrementally instead of all at once.

Is anybody more familiar with this step of the process?

Thanks.

On Thu, Jan 12, 2012 at 2:27 PM, Robert Evans <ev...@yahoo-inc.com> wrote:
My understanding is that the mapper will cache the output in memory until its 
memory buffer fills up, at which point it will sort the data and spill it to 
disk.  Once a given number of spill files are created they will be merged 
together into a larger spill file.  Once the mapper finishes then the output is 
totally merged into a single file that can be served to the Reducer through the 
TaskTracker, or NodeManger under YARN.  The reducer does a similar thing as it 
merges the output form all of the mappers.  I don't understand all of the 
reasons behind this, but I think much of it is to optimize the time it takes to 
sort the data.  If you try to merge too many files then you waste a lot of time 
doing seeks and less time reading data.  But I was not involved with developing 
it so I don't know for sure.

--Bobby Evans


On 1/12/12 10:27 AM, "Bai Shen" <baishen.li...@gmail.com 
<http://baishen.li...@gmail.com> > wrote:

Can someone explain how the map reduce merge is done?  As far as I can tell, it 
appears to pull all of the spill files into one giant file to send to the 
reducer.  Is this correct?  Even if you set smaller spill files and a lower 
sort factor, the eventual merge is still the same.  It just takes more passes 
to get there.

Thanks.

Re: Hadoop map reduce merge algorithm

Reply via email to