[
https://issues.apache.org/jira/browse/HBASE-22349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875277#comment-16875277
]
Xu Cang edited comment on HBASE-22349 at 6/28/19 11:06 PM:
-----------------------------------------------------------
This is a very good observation. One of my co-worker observed and debugged the
similar issue in our environment.
Obviously we don't want RS holds 0 regions and LB still think it is 'balanced'.
Besides tweaking 'minCostNeedBalance', maybe we can introduce a rule that when
RS holds 0 region, it sill trigger balancing regardless.
Or, we can adjust cost() for this class :
static class PrimaryRegionCountSkewCostFunction
to make this factor impacting more than others?
was (Author: xucang):
This is a very good observation. One of my co-worker observed and debugged the
similar issue in our environment.
Obviously we don't want RS holds 0 regions and LB still think it is 'balanced'.
Besides tweaking 'minCostNeedBalance', maybe we can introduce a rule that when
RS holds 0 region, it sill trigger balancing regardless.
> Stochastic Load Balancer skips balancing when node is replaced in cluster
> -------------------------------------------------------------------------
>
> Key: HBASE-22349
> URL: https://issues.apache.org/jira/browse/HBASE-22349
> Project: HBase
> Issue Type: Bug
> Affects Versions: 1.4.4
> Reporter: Suthan Phillips
> Priority: Major
> Attachments: Hbase-22349.pdf
>
>
> In EMR cluster, whenever I replace one of the nodes, the regions never get
> rebalanced.
> The default minCostNeedBalance set to 0.05 is too high.
> The region count on the servers were: 21, 21, 20, 20, 20, 20, 21, 20, 20, 20
> = 203
> Once a node(region server) got replaced with a new node (terminated and EMR
> recreated a node), the region count on the servers became: 23, 0, 23, 22, 22,
> 22, 22, 23, 23, 23 = 203
> From hbase-master-logs, I can see the below WARN which indicates that the
> default minCostNeedBalance does not hold good for these scenarios.
> ##
> 2019-04-29 09:31:37,027 WARN
> [ip-172-31-35-122.ec2.internal,16000,1556524892897_ChoreService_1]
> cleaner.CleanerChore: WALs outstanding under
> hdfs://ip-172-31-35-122.ec2.internal:8020/user/hbase/oldWALs2019-04-29
> 09:31:42,920 INFO
> [ip-172-31-35-122.ec2.internal,16000,1556524892897_ChoreService_1]
> balancer.StochasticLoadBalancer: Skipping load balancing because balanced
> cluster; total cost is 52.041826194833405, sum multiplier is 1102.0 min cost
> which need balance is 0.05
> ##
> To mitigate this, I had to modify the default minCostNeedBalance to lower
> value like 0.01f and restart Region Servers and Hbase Master. After modifying
> this value to 0.01f I could see the regions getting re-balanced.
> This has led me to the following questions which I would like to get it
> answered from the HBase experts.
> 1)What are the factors that affect the value of total cost and sum
> multiplier? How could we determine the right minCostNeedBalance value for any
> cluster?
> 2)How did Hbase arrive at setting the default value to 0.05f? Is it optimal
> value? If yes, then what is the recommended way to mitigate this scenario?
> Attached: Steps to reproduce
>
> Note: HBase-17565 patch is already applied.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)