> From: stack <[email protected]>
> Subject: Re: Using Hbase as data sink
> To: [email protected]
> Date: Tuesday, December 23, 2008, 8:05 AM
> Jim Twensky wrote:
> > ...
> > Why do we need to set the number of the reduce tasks
> > according to the number of regions? Would it make a
> > performance difference?
> >   
> 
> Regions are the 'natural' division in hbase.  My
> guess is that the partitioner was an attempt at calculating
> an N for reducers that was other than 1 or just some
> hard-coding.
 
I use the log of the regions as the number of reduce tasks to
run, and the default partitioner which just distributes the 
load evenly among the reducers using a hash of the key. More
precisely:

  HTable table = new HTable(conf, tableName);
  int nrReducers =
   (int)Math.ceil(
     Math.log1p((double)table.getStartKeys().length));
  // ...
  job.setNumReduceTasks(nrReducers);

I have no formal reason for this. The method just has a nice
feel to it.

   - Andy



      

Reply via email to