Hi Everyone,
I am currently attempting to run a Map Reduce job where the input
comed from HBase. The input table has 22 regions, and thus creates 22
map tasks. This however creates an issue since so few map tasks
results in a poor distribution of labor on a cluster of 10+ machines,
specifically since the amount of work required is highly variable
depending on the region.
I would like to increase the number of map tasks at least 2 fold,
the relevant code seems to be in TableInputFormat.
//Original code
Text[] startKeys = m_table.getStartKeys();
if(startKeys == null || startKeys.length == 0) {
throw new IOException("Expecting at least one region");
}
InputSplit[] splits = new InputSplit[startKeys.length];
for(int i = 0; i < startKeys.length; i++) {
splits[i] = new TableSplit(m_tableName, startKeys[i],
((i + 1) < startKeys.length) ? startKeys[i + 1] : new Text());
}
//end-original
//Modified code
Text[] startKeys = m_table.getStartKeys();
if(startKeys == null || startKeys.length == 0) {
throw new IOException("Expecting at least one region");
}
InputSplit[] splits = new InputSplit[startKeys.length*2];
for(int i = 0; i < startKeys.length; i++) {
Text halfsplit = new Text(""+Integer.parseInt(startKeys[i +
1].toString())/2);
splits[i] = new TableSplit(m_tableName, startKeys[i], halfsplit);
splits[i+1] = new TableSplit(m_tableName, halfsplit ,((i + 1)
< startKeys.length) ? startKeys[i + 2] : new Text());
}
//end-modified
Is seems like the required modifications would be something along the
lines the code written above. Is this the correct/best way to go about
this?
Thanks,
Jonathan