[
https://issues.apache.org/jira/browse/HBASE-22349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530220#comment-17530220
]
Hudson commented on HBASE-22349:
--------------------------------
Results for branch branch-2.5
[build #106 on
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/106/]:
(x) *{color:red}-1 overall{color}*
----
details (if available):
(x) {color:red}-1 general checks{color}
-- For more information [see general
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/106/General_20Nightly_20Build_20Report/]
(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2)
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/106/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]
(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3)
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/106/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/106/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 source release artifact{color}
-- See build output for details.
(/) {color:green}+1 client integration test{color}
> Stochastic Load Balancer skips balancing when node is replaced in cluster
> -------------------------------------------------------------------------
>
> Key: HBASE-22349
> URL: https://issues.apache.org/jira/browse/HBASE-22349
> Project: HBase
> Issue Type: Bug
> Components: Balancer
> Affects Versions: 3.0.0-alpha-1, 1.3.0, 1.4.4, 2.0.0
> Reporter: Suthan Phillips
> Assignee: David Manning
> Priority: Major
> Fix For: 2.5.0, 3.0.0-alpha-3
>
> Attachments: Hbase-22349.pdf
>
>
> HBASE-24139 allows the load balancer to run when one server has 0 regions and
> another server has more than 1 region. This is a special case of a more
> generic problem, where one server has far too few or far too many regions.
> The StochasticLoadBalancer defaults may decide the cluster is "balanced
> enough" according to {{hbase.master.balancer.stochastic.minCostNeedBalance}},
> even though one server may have a far higher or lower number of regions
> compared to the rest of the cluster.
> One specific example of this we have seen is when we use {{RegionMover}} to
> move regions back to a restarted RegionServer, if the
> {{StochasticLoadBalancer}} happens to be running. The load balancer sees a
> newly restarted RegionServer with 0 regions, and after HBASE-24139, it will
> balance regions to this server. Simultaneously, {{RegionMover}} moves back
> regions. The end result is that the newly restarted RegionServer has twice
> the load of any other server in the cluster. Future iterations of the load
> balancer do nothing, as the cluster cost does not exceed
> {{minCostNeedBalance}}.
> Another example is if the load balancer makes very slow progress on a
> cluster, it may not move the average cluster load to a newly restarted
> regionserver in one iteration. But after the first iteration, the balancer
> may again not run due to cluster cost not exceeding {{minCostNeedBalance}}.
> We can propose a solution where we reuse the {{slop}} concept in
> {{SimpleLoadBalancer}} and use this to extend the HBASE-24139 logic for
> deciding to run the balancer as long as there is a "sloppy" server in the
> cluster.
> +*Previous Description Notes Below, which are relevant, but as stated, were
> already fixed by HBASE-24139*+
> In EMR cluster, whenever I replace one of the nodes, the regions never get
> rebalanced.
> The default minCostNeedBalance set to 0.05 is too high.
> The region count on the servers were: 21, 21, 20, 20, 20, 20, 21, 20, 20, 20
> = 203
> Once a node(region server) got replaced with a new node (terminated and EMR
> recreated a node), the region count on the servers became: 23, 0, 23, 22, 22,
> 22, 22, 23, 23, 23 = 203
> From hbase-master-logs, I can see the below WARN which indicates that the
> default minCostNeedBalance does not hold good for these scenarios.
> ##
> 2019-04-29 09:31:37,027 WARN
> [ip-172-31-35-122.ec2.internal,16000,1556524892897_ChoreService_1]
> cleaner.CleanerChore: WALs outstanding under
> hdfs://ip-172-31-35-122.ec2.internal:8020/user/hbase/oldWALs2019-04-29
> 09:31:42,920 INFO
> [ip-172-31-35-122.ec2.internal,16000,1556524892897_ChoreService_1]
> balancer.StochasticLoadBalancer: Skipping load balancing because balanced
> cluster; total cost is 52.041826194833405, sum multiplier is 1102.0 min cost
> which need balance is 0.05
> ##
> To mitigate this, I had to modify the default minCostNeedBalance to lower
> value like 0.01f and restart Region Servers and Hbase Master. After modifying
> this value to 0.01f I could see the regions getting re-balanced.
> This has led me to the following questions which I would like to get it
> answered from the HBase experts.
> 1)What are the factors that affect the value of total cost and sum
> multiplier? How could we determine the right minCostNeedBalance value for any
> cluster?
> 2)How did Hbase arrive at setting the default value to 0.05f? Is it optimal
> value? If yes, then what is the recommended way to mitigate this scenario?
> Attached: Steps to reproduce
>
> Note: HBase-17565 patch is already applied.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)