Hello,
I have subclassed TableInputFormat and TableMapper. My job needs to read
from two tables (one row from each) during its map method. the reduce
method needs to write out to a table. For both the reads and the writes,
I am using simple Get and Put respectively with autoflush true.
One problem I see is that the number of map tasks that I get with HBase
is limited to the number of regions in the table. This seems to make the
job slower than it would be if I had many more mappers. Could I improve
the situation by overriding getSplits so that I could have many more
mappers?
I saw the following doc'd in TableMapReduceUtil: "Ensures that the given
number of reduce tasks for the given job configuration does not exceed
the number of regions for the given table. " Is there some reason one
would want to insure that the number of tasks doesn't exceed the number
of regions? It just seems to me that having one region serv only a
single task would result in an underloaded HBase. Thoughts?
-geoff