> From: stack <[email protected]>
> Subject: Re: Using Hbase as data sink
> To: [email protected]
> Date: Tuesday, December 23, 2008, 8:05 AM
> Jim Twensky wrote:
> > ...
> > Why do we need to set the number of the reduce tasks
> > according to the number of regions? Would it make a
> > performance difference?
> >
>
> Regions are the 'natural' division in hbase. My
> guess is that the partitioner was an attempt at calculating
> an N for reducers that was other than 1 or just some
> hard-coding.
I use the log of the regions as the number of reduce tasks to
run, and the default partitioner which just distributes the
load evenly among the reducers using a hash of the key. More
precisely:
HTable table = new HTable(conf, tableName);
int nrReducers =
(int)Math.ceil(
Math.log1p((double)table.getStartKeys().length));
// ...
job.setNumReduceTasks(nrReducers);
I have no formal reason for this. The method just has a nice
feel to it.
- Andy