I agree - you want a 1:1 mapping of tasks to regions. If you only
have 22 regions, then I would wonder whether you even need HBase to
host that data.
If the rows in a region are REALLY small, and the amount of data
you'll have is bounded pretty low, then maybe you should consider
changing the configuration for region size to be something less than
256MB. Right now this is something that's configured at the server
level, but ideally it'll eventually be configurable on a table-by-
table basis.
-Bryan
On May 8, 2008, at 1:43 PM, Andrew Purtell wrote:
I have also been considering this issue but from the opposite
direction
-- forcing splits of the table. From the perspective of I/O and
loading
optimization, doesn't it make the most sense to have a 1:1 mapping of
regions to tasks?
This issue I think will come up now and again if a user has tables
that
hold a large number of items yet those items have small keys and
column
data.
Of course there are some problems with this, first and foremost the
problem that too many regions for the carrying capacity of the cluster
will take down all of the region servers via OOME in a cascading
spiral
of death. Then there is the issue of making sure the key space of a
forced split is not too small as to underutilize the mapfile storage
available. Then there is the issue of key distributions being
dependent
on the particular dataset and schema, so tweaking the global setting
hbase.hregion.max.filesize might not be a good idea.
Just some random thoughts on the topic,
- Andrew Purtell
--- "Jonathan M. Kupferman" <[EMAIL PROTECTED]> wrote:
Hi Everyone,
I am currently attempting to run a Map Reduce job where the input
comed from HBase. The input table has 22 regions, and thus creates 22
map tasks. This however creates an issue since so few map tasks
results in a poor distribution of labor on a cluster of 10+ machines,
specifically since the amount of work required is highly variable
depending on the region.
I would like to increase the number of map tasks at least 2 fold,
the relevant code seems to be in TableInputFormat.
//Original code
Text[] startKeys = m_table.getStartKeys();
if(startKeys == null || startKeys.length == 0) {
throw new IOException("Expecting at least one region");
}
InputSplit[] splits = new InputSplit[startKeys.length];
for(int i = 0; i < startKeys.length; i++) {
splits[i] = new TableSplit(m_tableName, startKeys[i],
((i + 1) < startKeys.length) ? startKeys[i + 1] : new
Text());
}
//end-original
//Modified code
Text[] startKeys = m_table.getStartKeys();
if(startKeys == null || startKeys.length == 0) {
throw new IOException("Expecting at least one region");
}
InputSplit[] splits = new InputSplit[startKeys.length*2];
for(int i = 0; i < startKeys.length; i++) {
Text halfsplit = new Text(""+Integer.parseInt(startKeys[i +
1].toString())/2);
splits[i] = new TableSplit(m_tableName, startKeys[i],
halfsplit);
splits[i+1] = new TableSplit(m_tableName, halfsplit ,((i + 1)
< startKeys.length) ? startKeys[i + 2] : new Text());
}
//end-modified
Is seems like the required modifications would be something along the
lines the code written above. Is this the correct/best way to go
about
this?
Thanks,
Jonathan
______________________________________________________________________
______________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. Try it now. http://
mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ