Thanks for all your replies guys ,.As bharath said , what is the case when
number of reducers becomes more than number of distinct Map key outputs?

On Fri, Aug 21, 2009 at 9:39 AM, bharath vissapragada <
[email protected]> wrote:

> Aamandeep , Gray and Purtell thanks for your replies .. I have found them
> very useful.
>
> You said to increase the number of reduce tasks . Suppose the number of
> reduce tasks is more than number of distinct map output keys , some of the
> reduce processes may go waste ? is that the case?
>
> Also  I have one more doubt ..I have 5 values for a corresponding key on
> one
> region  and other 2 values on 2 different region servers.
> Does hadoop Map reduce take care of moving these 2 diff values to the
> region
> with 5 values instead of moving those 5 values to other system to minimize
> the dataflow? Is this what is happening inside ?
>
> On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell <[email protected]>
> wrote:
>
> > The behavior of TableInputFormat is to schedule one mapper for every
> table
> > region.
> >
> > In addition to what others have said already, if your reducer is doing
> > little more than storing data back into HBase (via TableOutputFormat),
> then
> > you can consider writing results back to HBase directly from the mapper
> to
> > avoid incurring the overhead of sort/shuffle/merge which happens within
> the
> > Hadoop job framework as map outputs are input into reducers. For that
> type
> > of use case -- using the Hadoop mapreduce subsystem as essentially a grid
> > scheduler -- something like job.setNumReducers(0) will do the trick.
> >
> > Best regards,
> >
> >   - Andy
> >
> >
> >
> >
> > ________________________________
> > From: john smith <[email protected]>
> > To: [email protected]
> > Sent: Friday, August 21, 2009 12:42:36 AM
> > Subject: Doubt in HBase
> >
> > Hi all ,
> >
> > I have one small doubt . Kindly answer it even if it sounds silly.
> >
> > Iam using Map Reduce in HBase in distributed mode .  I have a table which
> > spans across 5 region servers . I am using TableInputFormat to read the
> > data
> > from the tables in the map . When i run the program , by default how many
> > map regions are created ? Is it one per region server or more ?
> >
> > Also after the map task is over.. reduce task is taking a bit more time .
> > Is
> > it due to moving the map output across the regionservers? i.e, moving the
> > values of same key to a particular reduce phase to start the reducer? Is
> > there any way i can optimize the code (e.g. by storing data of same
> reducer
> > nearby )
> >
> > Thanks :)
> >
> >
> >
> >
>

Reply via email to