Thanks for all your replies guys ,.As bharath said , what is the case when number of reducers becomes more than number of distinct Map key outputs?
On Fri, Aug 21, 2009 at 9:39 AM, bharath vissapragada < [email protected]> wrote: > Aamandeep , Gray and Purtell thanks for your replies .. I have found them > very useful. > > You said to increase the number of reduce tasks . Suppose the number of > reduce tasks is more than number of distinct map output keys , some of the > reduce processes may go waste ? is that the case? > > Also I have one more doubt ..I have 5 values for a corresponding key on > one > region and other 2 values on 2 different region servers. > Does hadoop Map reduce take care of moving these 2 diff values to the > region > with 5 values instead of moving those 5 values to other system to minimize > the dataflow? Is this what is happening inside ? > > On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell <[email protected]> > wrote: > > > The behavior of TableInputFormat is to schedule one mapper for every > table > > region. > > > > In addition to what others have said already, if your reducer is doing > > little more than storing data back into HBase (via TableOutputFormat), > then > > you can consider writing results back to HBase directly from the mapper > to > > avoid incurring the overhead of sort/shuffle/merge which happens within > the > > Hadoop job framework as map outputs are input into reducers. For that > type > > of use case -- using the Hadoop mapreduce subsystem as essentially a grid > > scheduler -- something like job.setNumReducers(0) will do the trick. > > > > Best regards, > > > > - Andy > > > > > > > > > > ________________________________ > > From: john smith <[email protected]> > > To: [email protected] > > Sent: Friday, August 21, 2009 12:42:36 AM > > Subject: Doubt in HBase > > > > Hi all , > > > > I have one small doubt . Kindly answer it even if it sounds silly. > > > > Iam using Map Reduce in HBase in distributed mode . I have a table which > > spans across 5 region servers . I am using TableInputFormat to read the > > data > > from the tables in the map . When i run the program , by default how many > > map regions are created ? Is it one per region server or more ? > > > > Also after the map task is over.. reduce task is taking a bit more time . > > Is > > it due to moving the map output across the regionservers? i.e, moving the > > values of same key to a particular reduce phase to start the reducer? Is > > there any way i can optimize the code (e.g. by storing data of same > reducer > > nearby ) > > > > Thanks :) > > > > > > > > >
