[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Ye updated HBASE-12590:
-------------------------------
    Description: 
1, Motivation
In production environment, data skew is a very common case. A HBase table 
always contains a lot of small regions and several large regions. Small regions 
waste a lot of computing resources. If we use a job to scan a table with 3000 
small regions, we need a job with 3000 mappers. Large regions always block the 
job. If in a 100-region table, one region is far larger then the other 99 
regions. When we run a job with the table as input, 99 mappers will be 
completed very quickly, and we need to wait for the last mapper for a long time.

2, Configuration
Add two new configuration. 
hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in 
HBase-MapReduce jobs. The default value is false. 
hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size of 
mapreduce splits. 
If a region size is large than the target size, cut the region into two 
split.If the sum of several small continuous region size less than the target 
size, combine these regions into one split.

Example:
In attachment

Welcome to the Review Board.
https://reviews.apache.org/r/28494/diff/#



  was:
1, Motivation
In production environment, data skew is a very common case. A HBase table 
always contains a lot of small regions and several large regions. Small regions 
waste a lot of computing resources. If we use a job to scan a table with 3000 
small regions, we need a job with 3000 mappers. Large regions always block the 
job. If in a 100-region table, one region is far larger then the other 99 
regions. When we run a job with the table as input, 99 mappers will be 
completed very quickly, and we need to wait for the last mapper for a long time.

2, Configuration
Add two new configuration. 
hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in 
HBase-MapReduce jobs. The default value is false. 
hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size of 
mapreduce splits. 
If a region size is large than the target size, cut the region into two 
split.If the sum of several small continuous region size less than the target 
size, combine these regions into one split.

Example:
In attachment




> A solution for data skew in HBase-Mapreduce Job 
> ------------------------------------------------
>
>                 Key: HBASE-12590
>                 URL: https://issues.apache.org/jira/browse/HBASE-12590
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>    Affects Versions: 2.0.0
>            Reporter: Weichen Ye
>         Attachments: A Solution for Data Skew in HBase-MapReduce Job.pdf, 
> HBase-12590-v1.patch
>
>
> 1, Motivation
> In production environment, data skew is a very common case. A HBase table 
> always contains a lot of small regions and several large regions. Small 
> regions waste a lot of computing resources. If we use a job to scan a table 
> with 3000 small regions, we need a job with 3000 mappers. Large regions 
> always block the job. If in a 100-region table, one region is far larger then 
> the other 99 regions. When we run a job with the table as input, 99 mappers 
> will be completed very quickly, and we need to wait for the last mapper for a 
> long time.
> 2, Configuration
> Add two new configuration. 
> hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in 
> HBase-MapReduce jobs. The default value is false. 
> hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size 
> of mapreduce splits. 
> If a region size is large than the target size, cut the region into two 
> split.If the sum of several small continuous region size less than the target 
> size, combine these regions into one split.
> Example:
> In attachment
> Welcome to the Review Board.
> https://reviews.apache.org/r/28494/diff/#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to