I ran into an issue where lots of data was passing from mappers to a single reducer. Enabling a combiner saved quite a bit of processing time by reducing mapper disk writes and data movements to the reducer.
Nick Jones ----- Original Message ----- From: Cui tony <[email protected]> To: [email protected] <[email protected]> Sent: Tue Mar 30 21:24:18 2010 Subject: Re: question on shuffle and sort Consider this extreme situation: The input data is very large, and also the map result. 90% of map result have the same key, then all of them will be sent to one reducer tasknode. So 90% of work of reduce phase have to been done on a single node, not the cluster. That is very ineffective and less scalable. 2010/3/31 Jones, Nick <[email protected]> > Something to keep in mind though, sorting is appropriate to the key type. > Text will be sorted lexicographically. > > Nick Jones > > > ----- Original Message ----- > From: Ed Mazur <[email protected]> > To: [email protected] <[email protected]> > Sent: Tue Mar 30 21:07:29 2010 > Subject: Re: question on shuffle and sort > > On Tue, Mar 30, 2010 at 9:56 PM, Cui tony wrote: > > Did all key-value pairs of the map output, which have the same key, will > > be sent to the same reducer tasknode? > > Yes, this is at the core of the MapReduce model. There is one call to > the user reduce function per unique map output key. This grouping is > achieved by sorting which means you see keys in increasing order. > > Ed > > >
