Re: general question - how hadoop works

Todd Lipcon Sun, 20 Dec 2009 12:49:35 -0800

Hi Doopa,

In large multi-rack clusters, the network can become saturated in jobs like
sort. Hadoop does a few things to try to ameliorate the issue:


- The reducers start copying map output data before the map tasks complete.
Thus, the time spent copying is concurrent with the time spent processing
map tasks.
- Intermediate data compression can (and usually should) be enabled. Because
the keys are sorted coming out of the mapper, even simple compression
algorithms often achieve pretty good compression rates. This can reduce the
amount of data transfered in the shuffle.
- A Combiner can be written which runs on the map side before data is
shuffled. If you have many duplicate keys coming out of the mapper, this
reduces the amount of data significantly.

Additionally, there are many many applications of hadoop where the map
output is much smaller than the map input. For most aggregations, etc, the
mapper is just outputting sums/counts/etc, and with a combiner this is a
fraction the size of the original input.

Hope that helps
-Todd

On Sun, Dec 20, 2009 at 12:09 PM, doopha shaf <[email protected]> wrote:

>
> Trying to figure out how hadoop actually achieves its speed. Assuming that
> data locality is central to the efficiency of hadoop, how does the magic
> actually happen, given that data still gets moved all over the network to
> reach the reducers?
>
> For example, if I have 1gb of logs spread across 10 data nodes, and for the
> sake of argument, assume I use the identity mapper. Then 90% of data still
> needs to move across the network - how does the network not become
> saturated
> this way?
>
> What did I miss?...
> Thanks,
> D.S.
> --
> View this message in context:
> http://old.nabble.com/general-question---how-hadoop-works-tp26866842p26866842.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Re: general question - how hadoop works

Reply via email to