Dan, Shuffle and Sort is a combination of multiple 'algorithms'.
- Map output goes to a circular, in-memory buffer - When this starts filling up, it gets 'spilled' to disk - Spilling involves writing each K/V pair to a partition specific file (where partition is the algorithm Jim describes below) in order sorted by K - You may get multiple files per partition (you get a file per-partition every time a spill happens) - In which case, the spill files get merge sorted into larger files - Reducers pick up the final merged files from multiple mappers - Reducers may pick up multiple of these files from several mappers - The reducer may perform a final merge before starting the reduce phase On Wed, Apr 28, 2010 at 2:47 PM, Jim Twensky <[email protected]> wrote: > I'm not quite sure what you mean by the Shuffle algorithm but briefly > for each (key,value) pair, a hash value is computed and then the > record is sent to a node by taking the modulo of that hash. So if you > have n reducers, the record goes to reducer # : hash(key) % n. > > The sorting algorithm is most probably a variation of the external > sorting algorithm used for sorting objects that don't fit in the > memory. See: > > http://en.wikipedia.org/wiki/External_sorting > > for more details. > > Jim > > On Wed, Apr 28, 2010 at 1:16 PM, Dan Fundatureanu > <[email protected]> wrote: > > What is the algorithm used in "Shuffle and Sort" step? > > >
