Improve TableInputFormat to allow application to configure the number of mappers
--------------------------------------------------------------------------------
Key: HBASE-4063
URL: https://issues.apache.org/jira/browse/HBASE-4063
Project: HBase
Issue Type: Improvement
Components: mapreduce
Reporter: Ming Ma
Assignee: Ming Ma
TableInputFormat creates one split/mapper task per region. In the case of lots
of small regions, the overhead of map reduce framework becomes overhead. There
are some related work items that could address this issue.
1. Reduce the number of small regions.
https://issues.apache.org/jira/browse/HBASE-420
2. Improvement in map reduce framework to handle small jobs.
https://issues.apache.org/jira/browse/MAPREDUCE-1220
Another quick way to solve this is to just improve TableInputFormat so that it
can pack a configurable number of regions from a given region server into one
mapper task. I tested this approach and was able to achieve 40% improvement on
map job latency.
In addition, Ophir Cohen suggested support for multiple mappers per region as
below.
On Thu, Jun 30, 2011 at 8:38 AM, Ophir Cohen <[email protected]> wrote:
> Actually I thought of opposite version:
> If I have a spare map slots why not configure it to run more than one mapper
> on region?
> The question then is how to 'skip' the mappers to the needed places inside
> the regions.
Well, the current splitter passed mappers Scans where the start/end
rows are the region boundaries (at the time at which the splitter
ran).
To do your case, in the splitter, you'd just give out multiple splits
per region. To cut up the region key-space, you might use the
Bytes.split code. It does coarse BigNumber math dividing the key
space. See here:
http://hbase.apache.org/xref/org/apache/hadoop/hbase/util/Bytes.html#1034
St.Ack
To support the scenarios of:
a) One mapper for multiple regions.
b) Multiple mappers for one region.
We can modify TableInputFormat to allow application to config the number of
mappers. TableInputFormat will do the internal calculation to find out how to
config mappers' key range properly.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira