[
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230024#comment-14230024
]
Jonathan Hsieh commented on HBASE-12590:
----------------------------------------
Nice description of the problem in the slide deck. I did a quick scan of the
docs and the code and had a few questions.
1) The world "split" is ambiguous. Need to make it clear in java doc that this
is only a "MR input split" and not an "hbase region split" operation that would
trigger a lot io.
2) Why do we only split by 2? Why not split further so that we have n mr input
splits that are 1gb (in your example) instead of a 2x 3gb, 2x 2.5gb and a 2x
1gb "artificial" mr input splits?
3) To make this easier for users, do you think it might may sense to use
something other than a constant size (which assumes the user knows the the
server side region size property)? can we look at all of the regions sizes (we
have the info already with the RegionSizeCalculator), and just add new MR
inputsplits for the regions that are proportionately too large? Maybe we have
the setting be a ratio (maybe 5x-10x) larger than the median median region
size? That way the job won't have to change if the server side setting changes.
> A solution for data skew in HBase-Mapreduce Job
> -----------------------------------------------
>
> Key: HBASE-12590
> URL: https://issues.apache.org/jira/browse/HBASE-12590
> Project: HBase
> Issue Type: Improvement
> Components: mapreduce
> Reporter: Weichen Ye
> Attachments: A Solution for Data Skew in HBase-MapReduce Job.pdf,
> HBase-12590-v1.patch
>
>
> 1, Motivation
> In production environment, data skew is a very common case. A HBase table
> always contains a lot of small regions and several large regions. Small
> regions waste a lot of computing resources. If we use a job to scan a table
> with 3000 small regions, we need a job with 3000 mappers. Large regions
> always block the job. If in a 100-region table, one region is far larger then
> the other 99 regions. When we run a job with the table as input, 99 mappers
> will be completed very quickly, and we need to wait for the last mapper for a
> long time.
> 2, Configuration
> Add two new configuration.
> hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in
> HBase-MapReduce jobs. The default value is false.
> hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size
> of mapreduce splits.
> If a region size is large than the target size, cut the region into two
> split.If the sum of several small continuous region size less than the target
> size, combine these regions into one split.
> Example:
> In attachment
> Welcome to the Review Board.
> https://reviews.apache.org/r/28494/diff/#
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)