Weichen Ye created HBASE-12590:
----------------------------------
Summary: A solution for data skew in HBase-Mapreduce Job
Key: HBASE-12590
URL: https://issues.apache.org/jira/browse/HBASE-12590
Project: HBase
Issue Type: Improvement
Components: mapreduce
Affects Versions: 2.0.0
Reporter: Weichen Ye
1, Motivation
In production environment, data skew is a very common case. A HBase table
always contains a lot of small regions and several large regions. Small regions
waste a lot of computing resources. If we use a job to scan a table with 3000
small regions, we need a job with 3000 mappers. Large regions always block the
job. If in a 100-region table, one region is far larger then the other 99
regions. When we run a job with the table as input, 99 mappers will be
completed very quickly, and we need to wait for the last mapper for a long time.
2, Configuration
Add two new configuration.
hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in
HBase-MapReduce jobs. The default value is false.
hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size of
mapreduce splits.
If a region size is large than the target size, cut the region into two
split.If the sum of several small continuous region size less than the target
size, combine these regions into one split.
Example:
In attachment
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)