Re: Splitting within regions

Bryan Duxbury Thu, 08 May 2008 21:20:00 -0700

I agree - you want a 1:1 mapping of tasks to regions. If you onlyhave 22 regions, then I would wonder whether you even need HBase tohost that data.

If the rows in a region are REALLY small, and the amount of datayou'll have is bounded pretty low, then maybe you should considerchanging the configuration for region size to be something less than256MB. Right now this is something that's configured at the serverlevel, but ideally it'll eventually be configurable on a table-by-table basis.


-Bryan

On May 8, 2008, at 1:43 PM, Andrew Purtell wrote:

I have also been considering this issue but from the oppositedirection-- forcing splits of the table. From the perspective of I/O andloading

optimization, doesn't it make the most sense to have a 1:1 mapping of
regions to tasks?

This issue I think will come up now and again if a user has tablesthathold a large number of items yet those items have small keys andcolumn

data.

Of course there are some problems with this, first and foremost the
problem that too many regions for the carrying capacity of the cluster

will take down all of the region servers via OOME in a cascadingspiral

of death. Then there is the issue of making sure the key space of a
forced split is not too small as to underutilize the mapfile storage

available. Then there is the issue of key distributions beingdependent

on the particular dataset and schema, so tweaking the global setting
hbase.hregion.max.filesize might not be a good idea.

Just some random thoughts on the topic,

   - Andrew Purtell

--- "Jonathan M. Kupferman" <[EMAIL PROTECTED]> wrote:

Hi Everyone,
I am currently attempting to run a Map Reduce job where the input
comed from HBase. The input table has 22 regions, and thus creates 22
map tasks. This however creates an issue since so few map tasks
results in a poor distribution of labor on a cluster of 10+ machines,
specifically since the amount of work required is highly variable
depending on the region.

I would like to increase the number of map tasks at least 2 fold,
the relevant code seems to be in TableInputFormat.

//Original code
     Text[] startKeys = m_table.getStartKeys();
      if(startKeys == null || startKeys.length == 0) {
        throw new IOException("Expecting at least one region");
      }
      InputSplit[] splits = new InputSplit[startKeys.length];
      for(int i = 0; i < startKeys.length; i++) {
        splits[i] = new TableSplit(m_tableName, startKeys[i],
            ((i + 1) < startKeys.length) ? startKeys[i + 1] : new
Text());
      }
//end-original

//Modified code
     Text[] startKeys = m_table.getStartKeys();
      if(startKeys == null || startKeys.length == 0) {
        throw new IOException("Expecting at least one region");
      }
      InputSplit[] splits = new InputSplit[startKeys.length*2];
      for(int i = 0; i < startKeys.length; i++) {
       Text halfsplit = new Text(""+Integer.parseInt(startKeys[i +
1].toString())/2);
        splits[i] = new TableSplit(m_tableName, startKeys[i],
halfsplit);
        splits[i+1] = new TableSplit(m_tableName, halfsplit ,((i + 1)
< startKeys.length) ? startKeys[i + 2] : new Text());
      }
//end-modified

Is seems like the required modifications would be something along the

lines the code written above. Is this the correct/best way to goabout


this?


Thanks,
Jonathan

____________________________________________________________________________________

Be a better friend, newshound, and

know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

Re: Splitting within regions

Reply via email to