I ran into an issue where lots of data was passing from mappers to a single 
reducer. Enabling a combiner saved quite a bit of processing time by reducing 
mapper disk writes and data movements to the reducer.

Nick Jones


----- Original Message -----
From: Cui tony <[email protected]>
To: [email protected] <[email protected]>
Sent: Tue Mar 30 21:24:18 2010
Subject: Re: question on shuffle and sort

Consider this extreme situation:
The input data is very large, and also the map result. 90% of map result
have the same key, then all of them will be sent to one reducer tasknode.
So 90% of work of reduce phase have to been done on a single node, not the
cluster. That is very ineffective and less scalable.


2010/3/31 Jones, Nick <[email protected]>

> Something to keep in mind though, sorting is appropriate to the key type.
> Text will be sorted lexicographically.
>
> Nick Jones
>
>
> ----- Original Message -----
> From: Ed Mazur <[email protected]>
> To: [email protected] <[email protected]>
> Sent: Tue Mar 30 21:07:29 2010
> Subject: Re: question on shuffle and sort
>
> On Tue, Mar 30, 2010 at 9:56 PM, Cui tony wrote:
> >  Did all key-value pairs of the map output, which have the same key, will
> > be sent to the same reducer tasknode?
>
> Yes, this is at the core of the MapReduce model. There is one call to
> the user reduce function per unique map output key. This grouping is
> achieved by sorting which means you see keys in increasing order.
>
> Ed
>
>
>

Reply via email to