At present, I am using some heuristics about the values of row keys to
sort of guess what the start and end keys might be for finely slicing
the splits. I wind up getting splits of wildly different sizes. Some
will complete their respective mapper task in 15 minutes while others
take 10 hours.

Supposing I want to make 100 splits. I can get the start and end keys
for regions, but how would I go about slicing the table more finely, so
that I got say, 100 splits, all with about the same number or rows in
each split? I'm trying to get a bunch of mapper tasks that are all
roughly balanced in how long they take to execute.

-geoff 

-----Original Message-----
From: saint....@gmail.com [mailto:saint....@gmail.com] On Behalf Of
Stack
Sent: Friday, April 02, 2010 1:08 PM
To: hbase-user@hadoop.apache.org
Subject: Re: TableMapper and getSplits

Splitting a table on its Regions makes most sense when one table only
involved.  For your case, just override the splitter and make different
split objects.

As to the 'underloaded' hbase when one task per region, I'd say try it
first.  If many regions on the one regionserver, could make for a decent
load on the regionserver hosting.

Good luck,
St.Ack

On Fri, Apr 2, 2010 at 12:19 PM, Geoff Hendrey <ghend...@decarta.com>
wrote:
> Hello,
>
> I have subclassed TableInputFormat and TableMapper. My job needs to 
> read from two tables (one row from each) during its map method. the 
> reduce method needs to write out to a table. For both the reads and 
> the writes, I am using simple Get and Put respectively with autoflush
true.
>
> One problem I see is that the number of map tasks that I get with 
> HBase is limited to the number of regions in the table. This seems to 
> make the job slower than it would be if I had many more mappers. Could

> I improve the situation by overriding getSplits so that I could have 
> many more mappers?
>
> I saw the following doc'd in TableMapReduceUtil: "Ensures that the 
> given number of reduce tasks for the given job configuration does not 
> exceed the number of regions for the given table. " Is there some 
> reason one would want to insure that the number of tasks doesn't 
> exceed the number of regions? It just seems to me that having one 
> region serv only a single task would result in an underloaded HBase.
Thoughts?
>
> -geoff
>

Reply via email to