I agree - you want a 1:1 mapping of tasks to regions. If you only have 22 regions, then I would wonder whether you even need HBase to host that data.

If the rows in a region are REALLY small, and the amount of data you'll have is bounded pretty low, then maybe you should consider changing the configuration for region size to be something less than 256MB. Right now this is something that's configured at the server level, but ideally it'll eventually be configurable on a table-by- table basis.

-Bryan

On May 8, 2008, at 1:43 PM, Andrew Purtell wrote:

I have also been considering this issue but from the opposite direction -- forcing splits of the table. From the perspective of I/O and loading
optimization, doesn't it make the most sense to have a 1:1 mapping of
regions to tasks?

This issue I think will come up now and again if a user has tables that hold a large number of items yet those items have small keys and column
data.

Of course there are some problems with this, first and foremost the
problem that too many regions for the carrying capacity of the cluster
will take down all of the region servers via OOME in a cascading spiral
of death. Then there is the issue of making sure the key space of a
forced split is not too small as to underutilize the mapfile storage
available. Then there is the issue of key distributions being dependent
on the particular dataset and schema, so tweaking the global setting
hbase.hregion.max.filesize might not be a good idea.

Just some random thoughts on the topic,

   - Andrew Purtell

--- "Jonathan M. Kupferman" <[EMAIL PROTECTED]> wrote:

Hi Everyone,
I am currently attempting to run a Map Reduce job where the input
comed from HBase. The input table has 22 regions, and thus creates 22
map tasks. This however creates an issue since so few map tasks
results in a poor distribution of labor on a cluster of 10+ machines,
specifically since the amount of work required is highly variable
depending on the region.

I would like to increase the number of map tasks at least 2 fold,
the relevant code seems to be in TableInputFormat.

//Original code
     Text[] startKeys = m_table.getStartKeys();
      if(startKeys == null || startKeys.length == 0) {
        throw new IOException("Expecting at least one region");
      }
      InputSplit[] splits = new InputSplit[startKeys.length];
      for(int i = 0; i < startKeys.length; i++) {
        splits[i] = new TableSplit(m_tableName, startKeys[i],
            ((i + 1) < startKeys.length) ? startKeys[i + 1] : new
Text());
      }
//end-original

//Modified code
     Text[] startKeys = m_table.getStartKeys();
      if(startKeys == null || startKeys.length == 0) {
        throw new IOException("Expecting at least one region");
      }
      InputSplit[] splits = new InputSplit[startKeys.length*2];
      for(int i = 0; i < startKeys.length; i++) {
       Text halfsplit = new Text(""+Integer.parseInt(startKeys[i +
1].toString())/2);
        splits[i] = new TableSplit(m_tableName, startKeys[i],
halfsplit);
        splits[i+1] = new TableSplit(m_tableName, halfsplit ,((i + 1)
< startKeys.length) ? startKeys[i + 2] : new Text());
      }
//end-modified

Is seems like the required modifications would be something along the
lines the code written above. Is this the correct/best way to go about

this?


Thanks,
Jonathan





______________________________________________________________________ ______________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. Try it now. http:// mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

Reply via email to