[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14233278#comment-14233278
 ] 

Jonathan Hsieh commented on HBASE-12590:
----------------------------------------

{quote}
2) It is a difficult issue in this patch. It is hard (~for me) to split a large 
region into several small "MR input splits" with target size ( we have only 
"start rowkey", "end rowkey" and the Region size). So my point is just find a 
"mid rowkey" between "start rowkey" and "end rowkey". Do you have any ideas 
about this? For instance if we split a 5GB region into five 1GB MR input 
splits, how to find the split point(rowkey) to make the size of these MR input 
splits equal to 1GB?
{quote}

internally the split operation tries to read the cell closest to the the mid 
point of the hfiles and doesn't make rowkey distribution assumptions[1,2,3].   
These values however are not exposed  for the MR format to use.  In v1 and v2 
here calculates a split point assuming an ascii-centric, uniformly distribution 
of rowkeys in the inputsplit.  You should at least note that in the docs.  
Since you are generating the split point based on the uniform distribution 
assumption, you can probably actually relatively easily calculate more split 
points.

thanks for posting on review board, I've added more comments there.

[1] 
https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L6023
[2] 
https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RegionSplitPolicy.java#L67
[3] 
https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java#L670

> A solution for data skew in HBase-Mapreduce Job
> -----------------------------------------------
>
>                 Key: HBASE-12590
>                 URL: https://issues.apache.org/jira/browse/HBASE-12590
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>            Reporter: Weichen Ye
>         Attachments: A Solution for Data Skew in HBase-MapReduce Job 
> (Version2).pdf, HBase-12590-v1.patch, HBase-12590-v2.patch
>
>
> 1, Motivation
> In production environment, data skew is a very common case. A HBase table 
> always contains a lot of small regions and several large regions. Small 
> regions waste a lot of computing resources. If we use a job to scan a table 
> with 3000 small regions, we need a job with 3000 mappers. Large regions 
> always block the job. If in a 100-region table, one region is far larger then 
> the other 99 regions. When we run a job with the table as input, 99 mappers 
> will be completed very quickly, and we need to wait for the last mapper for a 
> long time.
> 2, Configuration
> Add two new configuration. 
> hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in 
> HBase-MapReduce jobs. The default value is false. 
> hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size 
> of mapreduce splits. 
> If a region size is large than the target size, cut the region into two 
> split.If the sum of several small continuous region size less than the target 
> size, combine these regions into one split.
> Example:
> In attachment
> Welcome to the Review Board.
> https://reviews.apache.org/r/28494/diff/#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to