Dan,

Shuffle and Sort is a combination of multiple 'algorithms'.

- Map output goes to a circular, in-memory buffer
- When this starts filling up, it gets 'spilled' to disk
- Spilling involves writing each K/V pair to a partition specific file
(where partition is the algorithm Jim describes below) in order sorted by K
- You may get multiple files per partition (you get a file per-partition
every time a spill happens)
- In which case, the spill files get merge sorted into larger files
- Reducers pick up the final merged files from multiple mappers
- Reducers may pick up multiple of these files from several mappers
- The reducer may perform a final merge before starting the reduce phase



On Wed, Apr 28, 2010 at 2:47 PM, Jim Twensky <[email protected]> wrote:

> I'm not quite sure what you mean by the Shuffle algorithm but briefly
> for each (key,value) pair, a hash value is computed and then the
> record is sent to a node by taking the modulo of that hash. So if you
> have n reducers, the record goes to reducer # : hash(key) % n.
>
> The sorting algorithm is most probably a variation of the external
> sorting algorithm used for sorting objects that don't fit in the
> memory. See:
>
> http://en.wikipedia.org/wiki/External_sorting
>
> for more details.
>
> Jim
>
> On Wed, Apr 28, 2010 at 1:16 PM, Dan Fundatureanu
> <[email protected]> wrote:
> > What is the algorithm used in "Shuffle and Sort" step?
> >
>

Reply via email to