[
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jonathan Hsieh updated HBASE-12590:
-----------------------------------
Summary: im (was: kim)
> im
> --
>
> Key: HBASE-12590
> URL: https://issues.apache.org/jira/browse/HBASE-12590
> Project: HBase
> Issue Type: Improvement
> Components: mapreduce
> Reporter: Weichen Ye
> Attachments: A Solution for Data Skew in HBase-MapReduce Job.pdf,
> HBase-12590-v1.patch
>
>
> 1, Motivation
> In production environment, data skew is a very common case. A HBase table
> always contains a lot of small regions and several large regions. Small
> regions waste a lot of computing resources. If we use a job to scan a table
> with 3000 small regions, we need a job with 3000 mappers. Large regions
> always block the job. If in a 100-region table, one region is far larger then
> the other 99 regions. When we run a job with the table as input, 99 mappers
> will be completed very quickly, and we need to wait for the last mapper for a
> long time.
> 2, Configuration
> Add two new configuration.
> hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in
> HBase-MapReduce jobs. The default value is false.
> hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size
> of mapreduce splits.
> If a region size is large than the target size, cut the region into two
> split.If the sum of several small continuous region size less than the target
> size, combine these regions into one split.
> Example:
> In attachment
> Welcome to the Review Board.
> https://reviews.apache.org/r/28494/diff/#
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)