Aamandeep , Gray and Purtell thanks for your replies .. I have found them very useful.
You said to increase the number of reduce tasks . Suppose the number of reduce tasks is more than number of distinct map output keys , some of the reduce processes may go waste ? is that the case? Also I have one more doubt ..I have 5 values for a corresponding key on one region and other 2 values on 2 different region servers. Does hadoop Map reduce take care of moving these 2 diff values to the region with 5 values instead of moving those 5 values to other system to minimize the dataflow? Is this what is happening inside ? On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell <[email protected]> wrote: > The behavior of TableInputFormat is to schedule one mapper for every table > region. > > In addition to what others have said already, if your reducer is doing > little more than storing data back into HBase (via TableOutputFormat), then > you can consider writing results back to HBase directly from the mapper to > avoid incurring the overhead of sort/shuffle/merge which happens within the > Hadoop job framework as map outputs are input into reducers. For that type > of use case -- using the Hadoop mapreduce subsystem as essentially a grid > scheduler -- something like job.setNumReducers(0) will do the trick. > > Best regards, > > - Andy > > > > > ________________________________ > From: john smith <[email protected]> > To: [email protected] > Sent: Friday, August 21, 2009 12:42:36 AM > Subject: Doubt in HBase > > Hi all , > > I have one small doubt . Kindly answer it even if it sounds silly. > > Iam using Map Reduce in HBase in distributed mode . I have a table which > spans across 5 region servers . I am using TableInputFormat to read the > data > from the tables in the map . When i run the program , by default how many > map regions are created ? Is it one per region server or more ? > > Also after the map task is over.. reduce task is taking a bit more time . > Is > it due to moving the map output across the regionservers? i.e, moving the > values of same key to a particular reduce phase to start the reducer? Is > there any way i can optimize the code (e.g. by storing data of same reducer > nearby ) > > Thanks :) > > > >
