Re: getSplits() in TableInputFormatBase

john smith Sun, 11 Apr 2010 01:40:15 -0700

Amandeep ,

Thanks for the explanation . What is the default value to the num of maps ?
Is it not equal to the num of regions ?


Right now I am running HBase in pseudo distributed mode . If I set num of
map tasks to 100000 (some big num)..

I get numSplits=1

If I dont set any thing .. numSplits =2;


Can you explain this.

Thanks
j.S

On Sun, Apr 11, 2010 at 1:50 PM, Amandeep Khurana <[email protected]> wrote:

> If you set the number of map tasks as a higher number than the number of
> regions (I generally set it to 100000 or something like that), the number
> of
> splits = number of regions. If you keep it lower, then it combines regions
> in a single split.
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Sun, Apr 11, 2010 at 1:15 AM, john smith <[email protected]>
> wrote:
>
> > Amandeep,
> >
> > I guess that is not true ,.. See the explanation as in docs ..
> >
> >
> > "Splits are created in number equal to the smallest between numSplits and
> > the number of HRegion<
> >
> http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html
> > >s
> > in the table. If the number of splits is smaller than the number of
> > HRegion<
> >
> http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html
> > >s
> > then splits are spanned across multiple
> > HRegion<
> >
> http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html
> > >s
> > and are grouped the most evenly possible. In the case splits are uneven
> the
> > bigger splits are placed first in the InputSplit array.  "
> >
> >
> > depending on whether numSplits < (or >)  num of regions .. it choses real
> > number of splits and the same is done in the code
> >
> > // Code
> >  int realNumSplits = numSplits > startKeys.length? startKeys.length:
> > numSplits;
> >
> > Here startKeys.length is the number of regions...
> >
> > Am I true?
> >
> > Thanks
> > j.S
> >
> >
> >
> > On Sun, Apr 11, 2010 at 1:33 PM, Amandeep Khurana <[email protected]>
> > wrote:
> >
> > > The number of splits is equal to the number of regions...
> > >
> > >
> > >
> > > On Sun, Apr 11, 2010 at 12:54 AM, john smith <[email protected]>
> > > wrote:
> > >
> > > > Hi ,
> > > >
> > > > In the method  "public org.apache.hadoop.mapred.InputSplit[]
> > *getSplits*
> > > > (org.apache.hadoop.mapred.JobConf job,
> > > >
> > > >                                                       int numSplits)
> "
> > > >
> > > > how is the "numSplits" decided ? I've seen differnt values of
> > > > numSplits for different MR jobs . Any reason for this ?
> > > >
> > > > Also what if I ignore numsplits and always split at region
> > > > boundaries.I guess that , splitting at region boundaries makes more
> > > > sense and improves some what data locality.
> > > >
> > > > Any comments on the above statement?
> > > >
> > > > Thanks
> > > >
> > > > j.S
> > > >
> > >
> >
>

Re: getSplits() in TableInputFormatBase

Reply via email to