[
https://issues.apache.org/jira/browse/HBASE-22349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Suthan Phillips updated HBASE-22349:
------------------------------------
Affects Version/s: 1.4.4
Attachment: Hbase-22349.pdf
Description:
In EMR cluster, whenever I replace one of the nodes, the regions never get
rebalanced.
The default minCostNeedBalance set to 0.05 is too high.
The region count on the servers were: 21, 21, 20, 20, 20, 20, 21, 20, 20, 20 =
203
Once a node(region server) got replaced with a new node (terminated and EMR
recreated a node), the region count on the servers became: 23, 0, 23, 22, 22,
22, 22, 23, 23, 23 = 203
>From hbase-master-logs, I can see the below WARN which indicates that the
>default minCostNeedBalance does not hold good for these scenarios.
##
2019-04-29 09:31:37,027 WARN
[ip-172-31-35-122.ec2.internal,16000,1556524892897_ChoreService_1]
cleaner.CleanerChore: WALs outstanding under
hdfs://ip-172-31-35-122.ec2.internal:8020/user/hbase/oldWALs2019-04-29
09:31:42,920 INFO
[ip-172-31-35-122.ec2.internal,16000,1556524892897_ChoreService_1]
balancer.StochasticLoadBalancer: Skipping load balancing because balanced
cluster; total cost is 52.041826194833405, sum multiplier is 1102.0 min cost
which need balance is 0.05
##
To mitigate this, I had to modify the default minCostNeedBalance to lower value
like 0.01f and restart Region Servers and Hbase Master. After modifying this
value to 0.01f I could see the regions getting re-balanced.
This has led me to the following questions which I would like to get it
answered from the HBase experts.
1)What are the factors that affect the value of total cost and sum multiplier?
How could we determine the right minCostNeedBalance value for any cluster?
2)How did Hbase arrive at setting the default value to 0.05f? Is it optimal
value? If yes, then what is the recommended way to mitigate this scenario?
Attached: Steps to reproduce
Note: HBase-17565 patch is already applied.
Summary: Stochastic Load Balancer skips balancing when node is
replaced in cluster (was: eifjccgngfnjugrvnklblbflhjfehbbckhcktubbnvur)
> Stochastic Load Balancer skips balancing when node is replaced in cluster
> -------------------------------------------------------------------------
>
> Key: HBASE-22349
> URL: https://issues.apache.org/jira/browse/HBASE-22349
> Project: HBase
> Issue Type: Bug
> Affects Versions: 1.4.4
> Reporter: Suthan Phillips
> Priority: Major
> Attachments: Hbase-22349.pdf
>
>
> In EMR cluster, whenever I replace one of the nodes, the regions never get
> rebalanced.
> The default minCostNeedBalance set to 0.05 is too high.
> The region count on the servers were: 21, 21, 20, 20, 20, 20, 21, 20, 20, 20
> = 203
> Once a node(region server) got replaced with a new node (terminated and EMR
> recreated a node), the region count on the servers became: 23, 0, 23, 22, 22,
> 22, 22, 23, 23, 23 = 203
> From hbase-master-logs, I can see the below WARN which indicates that the
> default minCostNeedBalance does not hold good for these scenarios.
> ##
> 2019-04-29 09:31:37,027 WARN
> [ip-172-31-35-122.ec2.internal,16000,1556524892897_ChoreService_1]
> cleaner.CleanerChore: WALs outstanding under
> hdfs://ip-172-31-35-122.ec2.internal:8020/user/hbase/oldWALs2019-04-29
> 09:31:42,920 INFO
> [ip-172-31-35-122.ec2.internal,16000,1556524892897_ChoreService_1]
> balancer.StochasticLoadBalancer: Skipping load balancing because balanced
> cluster; total cost is 52.041826194833405, sum multiplier is 1102.0 min cost
> which need balance is 0.05
> ##
> To mitigate this, I had to modify the default minCostNeedBalance to lower
> value like 0.01f and restart Region Servers and Hbase Master. After modifying
> this value to 0.01f I could see the regions getting re-balanced.
> This has led me to the following questions which I would like to get it
> answered from the HBase experts.
> 1)What are the factors that affect the value of total cost and sum
> multiplier? How could we determine the right minCostNeedBalance value for any
> cluster?
> 2)How did Hbase arrive at setting the default value to 0.05f? Is it optimal
> value? If yes, then what is the recommended way to mitigate this scenario?
> Attached: Steps to reproduce
>
> Note: HBase-17565 patch is already applied.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)