At present, I am using some heuristics about the values of row keys to sort of guess what the start and end keys might be for finely slicing the splits. I wind up getting splits of wildly different sizes. Some will complete their respective mapper task in 15 minutes while others take 10 hours.
Supposing I want to make 100 splits. I can get the start and end keys for regions, but how would I go about slicing the table more finely, so that I got say, 100 splits, all with about the same number or rows in each split? I'm trying to get a bunch of mapper tasks that are all roughly balanced in how long they take to execute. -geoff -----Original Message----- From: saint....@gmail.com [mailto:saint....@gmail.com] On Behalf Of Stack Sent: Friday, April 02, 2010 1:08 PM To: hbase-user@hadoop.apache.org Subject: Re: TableMapper and getSplits Splitting a table on its Regions makes most sense when one table only involved. For your case, just override the splitter and make different split objects. As to the 'underloaded' hbase when one task per region, I'd say try it first. If many regions on the one regionserver, could make for a decent load on the regionserver hosting. Good luck, St.Ack On Fri, Apr 2, 2010 at 12:19 PM, Geoff Hendrey <ghend...@decarta.com> wrote: > Hello, > > I have subclassed TableInputFormat and TableMapper. My job needs to > read from two tables (one row from each) during its map method. the > reduce method needs to write out to a table. For both the reads and > the writes, I am using simple Get and Put respectively with autoflush true. > > One problem I see is that the number of map tasks that I get with > HBase is limited to the number of regions in the table. This seems to > make the job slower than it would be if I had many more mappers. Could > I improve the situation by overriding getSplits so that I could have > many more mappers? > > I saw the following doc'd in TableMapReduceUtil: "Ensures that the > given number of reduce tasks for the given job configuration does not > exceed the number of regions for the given table. " Is there some > reason one would want to insure that the number of tasks doesn't > exceed the number of regions? It just seems to me that having one > region serv only a single task would result in an underloaded HBase. Thoughts? > > -geoff >