Re: Algorithm used "Shuffle and Sort" step

Dan Fundatureanu Thu, 29 Apr 2010 11:13:14 -0700

Thank you,
and I think this might hold also the answer for my question:

in "hadoop-default.xml" :


<property>
  <name>map.sort.class</name>
  <value>org.apache.hadoop.util.QuickSort</value>
  <description>The default sort class for sorting keys.</description>
</property>



On Wed, Apr 28, 2010 at 3:27 PM, Patrick Angeles <[email protected]>wrote:

> Dan,
>
> Shuffle and Sort is a combination of multiple 'algorithms'.
>
> - Map output goes to a circular, in-memory buffer
> - When this starts filling up, it gets 'spilled' to disk
> - Spilling involves writing each K/V pair to a partition specific file
> (where partition is the algorithm Jim describes below) in order sorted by K
> - You may get multiple files per partition (you get a file per-partition
> every time a spill happens)
> - In which case, the spill files get merge sorted into larger files
> - Reducers pick up the final merged files from multiple mappers
> - Reducers may pick up multiple of these files from several mappers
> - The reducer may perform a final merge before starting the reduce phase
>
>
>
> On Wed, Apr 28, 2010 at 2:47 PM, Jim Twensky <[email protected]>
> wrote:
>
> > I'm not quite sure what you mean by the Shuffle algorithm but briefly
> > for each (key,value) pair, a hash value is computed and then the
> > record is sent to a node by taking the modulo of that hash. So if you
> > have n reducers, the record goes to reducer # : hash(key) % n.
> >
> > The sorting algorithm is most probably a variation of the external
> > sorting algorithm used for sorting objects that don't fit in the
> > memory. See:
> >
> > http://en.wikipedia.org/wiki/External_sorting
> >
> > for more details.
> >
> > Jim
> >
> > On Wed, Apr 28, 2010 at 1:16 PM, Dan Fundatureanu
> > <[email protected]> wrote:
> > > What is the algorithm used in "Shuffle and Sort" step?
> > >
> >
>

Re: Algorithm used "Shuffle and Sort" step

Reply via email to