[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230866#comment-14230866
 ] 

Weichen Ye commented on HBASE-12590:
------------------------------------

[~jmhsieh] Thank you for your review and your advice! 

1) The word "split" may be confusing or misleading here. I`ll change the code 
and doc about this.

2)  It is a difficult issue in this patch. It is hard (~for me) to split a 
large region into several small "MR input splits" with target size ( we have 
only "start rowkey", "end rowkey" and the Region size). So my point is just 
find a "mid rowkey" between "start rowkey" and "end rowkey". Do you have any 
ideas about this? For instance if we split a 5GB region into five 1GB MR input 
splits, how to find the split point(rowkey) to make the size of these MR input 
splits equal to 1GB?

3) You give me a great idea! I totally agree to set a ratio other than a 
constant size in configuration. This week I`ll making a new patch in this new 
way.  


> A solution for data skew in HBase-Mapreduce Job
> -----------------------------------------------
>
>                 Key: HBASE-12590
>                 URL: https://issues.apache.org/jira/browse/HBASE-12590
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>            Reporter: Weichen Ye
>         Attachments: A Solution for Data Skew in HBase-MapReduce Job.pdf, 
> HBase-12590-v1.patch
>
>
> 1, Motivation
> In production environment, data skew is a very common case. A HBase table 
> always contains a lot of small regions and several large regions. Small 
> regions waste a lot of computing resources. If we use a job to scan a table 
> with 3000 small regions, we need a job with 3000 mappers. Large regions 
> always block the job. If in a 100-region table, one region is far larger then 
> the other 99 regions. When we run a job with the table as input, 99 mappers 
> will be completed very quickly, and we need to wait for the last mapper for a 
> long time.
> 2, Configuration
> Add two new configuration. 
> hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in 
> HBase-MapReduce jobs. The default value is false. 
> hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size 
> of mapreduce splits. 
> If a region size is large than the target size, cut the region into two 
> split.If the sum of several small continuous region size less than the target 
> size, combine these regions into one split.
> Example:
> In attachment
> Welcome to the Review Board.
> https://reviews.apache.org/r/28494/diff/#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to