Hi, Jones As you have met the situation I am worried about, I got my answer now. Maybe re-design the map function or add a combiner is the only way to deal with this kind of input data .
2010/3/31 Jones, Nick <[email protected]> > I ran into an issue where lots of data was passing from mappers to a single > reducer. Enabling a combiner saved quite a bit of processing time by > reducing mapper disk writes and data movements to the reducer. > > Nick Jones > > > ----- Original Message ----- > From: Cui tony <[email protected]> > To: [email protected] <[email protected]> > Sent: Tue Mar 30 21:24:18 2010 > Subject: Re: question on shuffle and sort > > Consider this extreme situation: > The input data is very large, and also the map result. 90% of map result > have the same key, then all of them will be sent to one reducer tasknode. > So 90% of work of reduce phase have to been done on a single node, not the > cluster. That is very ineffective and less scalable. > > > 2010/3/31 Jones, Nick <[email protected]> > > > Something to keep in mind though, sorting is appropriate to the key type. > > Text will be sorted lexicographically. > > > > Nick Jones > > > > > > ----- Original Message ----- > > From: Ed Mazur <[email protected]> > > To: [email protected] <[email protected]> > > Sent: Tue Mar 30 21:07:29 2010 > > Subject: Re: question on shuffle and sort > > > > On Tue, Mar 30, 2010 at 9:56 PM, Cui tony wrote: > > > Did all key-value pairs of the map output, which have the same key, > will > > > be sent to the same reducer tasknode? > > > > Yes, this is at the core of the MapReduce model. There is one call to > > the user reduce function per unique map output key. This grouping is > > achieved by sorting which means you see keys in increasing order. > > > > Ed > > > > > > > >
