Re: Using Hbase as data sink

stack Tue, 23 Dec 2008 08:05:42 -0800

Jim Twensky wrote:

...
Why do we need to set the number of the reduce tasks according to the number
of regions? Would it make a performance difference?

Regions are the 'natural' division in hbase. My guess is that thepartitioner was an attempt at calculating an N for reducers that wasother than 1 or just some hard-coding.

Other considerations are that at the reduce stage, keys are sorted soinserts into hbase will be ordered. In this case, cutting the key spaceso its divided at region boundaries could help distributing the uploadand help performance. I'd imagine this would work best in a maturetable, one that is already carrying a load, and where the upload is somesmallish percentage of the total. Otherwise, regions splitting wouldthrow this partitioner calculation out of kilter.

I am asking this because I didn't use it in my implementation. I configure
the table name and output formats inside the run method which looks like
this:

public int run(String[] args) throws Exception {
....

conf.setOutputKeyClass(ImmutableBytesWritable.class);

conf.setOutputValueClass(BatchUpdate.class);
conf.set("output.table.name",args[1]);
...

}

Notice that I don't have access to the partitioner unlike the
initTableReduceJob method. Is there a way to overcome this?

Pardon me but which 'run' method? Why do you not have access? Its apublic class? (Sorry if I'm missing an obvious -- still on first cup ofcoffee).


St.Ack

Re: Using Hbase as data sink

Reply via email to