Re: getSplits() in TableInputFormatBase

john smith Sun, 11 Apr 2010 01:16:24 -0700

Amandeep,

I guess that is not true ,.. See the explanation as in docs ..

"Splits are created in number equal to the smallest between numSplits and
the number of 
HRegion<http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html>s
in the table. If the number of splits is smaller than the number of
HRegion<http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html>s
then splits are spanned across multiple
HRegion<http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/regionserver/HRegion.html>s
and are grouped the most evenly possible. In the case splits are uneven the
bigger splits are placed first in the InputSplit array.  "

depending on whether numSplits < (or >)  num of regions .. it choses real
number of splits and the same is done in the code

// Code
 int realNumSplits = numSplits > startKeys.length? startKeys.length:
numSplits;

Here startKeys.length is the number of regions...

Am I true?

Thanks
j.S

On Sun, Apr 11, 2010 at 1:33 PM, Amandeep Khurana <ama...@gmail.com> wrote:

> The number of splits is equal to the number of regions...
>
>
>
> On Sun, Apr 11, 2010 at 12:54 AM, john smith <js1987.sm...@gmail.com>
> wrote:
>
> > Hi ,
> >
> > In the method  "public org.apache.hadoop.mapred.InputSplit[] *getSplits*
> > (org.apache.hadoop.mapred.JobConf job,
> >
> >                                                       int numSplits) "
> >
> > how is the "numSplits" decided ? I've seen differnt values of
> > numSplits for different MR jobs . Any reason for this ?
> >
> > Also what if I ignore numsplits and always split at region
> > boundaries.I guess that , splitting at region boundaries makes more
> > sense and improves some what data locality.
> >
> > Any comments on the above statement?
> >
> > Thanks
> >
> > j.S
> >
>

Re: getSplits() in TableInputFormatBase

Reply via email to