TableInputFormat creates one split/mapper task per region. In the case of lots 
of small regions, the overhead of map reduce framework becomes overhead. There 
are some related work items that could address this issue.


1.       Reduce the number of small regions. 
https://issues.apache.org/jira/browse/HBASE-420

2.       Improvement in map reduce framework to handle small jobs. 
https://issues.apache.org/jira/browse/MAPREDUCE-1220

Another quick way to solve this is to just improve TableInputFormat so that it 
can pack a configurable number of regions from a given region server into one 
mapper task. I tested this approach and was able to achieve 40% improvement on 
map job latency.

Any feedback?

Ming

Reply via email to