Jim Twensky wrote:
...
Why do we need to set the number of the reduce tasks according to the number
of regions? Would it make a performance difference?

Regions are the 'natural' division in hbase. My guess is that the partitioner was an attempt at calculating an N for reducers that was other than 1 or just some hard-coding.

Other considerations are that at the reduce stage, keys are sorted so inserts into hbase will be ordered. In this case, cutting the key space so its divided at region boundaries could help distributing the upload and help performance. I'd imagine this would work best in a mature table, one that is already carrying a load, and where the upload is some smallish percentage of the total. Otherwise, regions splitting would throw this partitioner calculation out of kilter.

I am asking this because I didn't use it in my implementation. I configure
the table name and output formats inside the run method which looks like
this:

public int run(String[] args) throws Exception {
....
conf.setOutputKeyClass(ImmutableBytesWritable.class);
conf.setOutputValueClass(BatchUpdate.class);
conf.set("output.table.name",args[1]);
...

}

Notice that I don't have access to the partitioner unlike the
initTableReduceJob method. Is there a way to overcome this?

Pardon me but which 'run' method? Why do you not have access? Its a public class? (Sorry if I'm missing an obvious -- still on first cup of coffee).

St.Ack

Reply via email to