[ 
https://issues.apache.org/jira/browse/HBASE-17565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852551#comment-15852551
 ] 

Guanghao Zhang commented on HBASE-17565:
----------------------------------------

Now EPSILON = 0.000000000000001D, do we need consider the aggregate effect of 
cost multiplied by multiplier? This only effect when there are a very very big 
multiplier......

bq. The large multiplier for read replica was obtained through trial and error 
when developing read replica feature.
As the javadoc of StochasticLoadBalancer said:
{code}
 * <p>Every cost function returns a number between 0 and 1 inclusive; where 0 
is the lowest cost
 * best solution, and 1 is the highest possible cost and the worst solution.  
The computed costs are
 * scaled by their respective multipliers:</p>
{code}
The bigger multiplier means that the respective cost function have the bigger 
weight. Why we need a so big default multiplier for read replica? It means read 
replica has biggest weight of all cost function. In our use case, we always 
config the sum of all cost function's respective multipliers to 100. 

> StochasticLoadBalancer may incorrectly skip balancing due to skewed 
> multiplier sum
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-17565
>                 URL: https://issues.apache.org/jira/browse/HBASE-17565
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ted Yu
>            Assignee: Ted Yu
>            Priority: Critical
>             Fix For: 2.0.0, 1.4.0
>
>         Attachments: 17565.v1.txt, 17565.v2.txt, 17565.v3.txt
>
>
> I was investigating why a 6 node cluster kept skipping balancing requests.
> Here were the region counts on the servers:
> 449, 448, 447, 449, 453, 0
> {code}
> 2017-01-26 22:04:47,145 INFO  
> [RpcServer.deafult.FPBQ.Fifo.handler=1,queue=0,port=16000] 
> balancer.StochasticLoadBalancer: Skipping load balancing because balanced 
> cluster; total cost is 127.0171157050385, sum multiplier is 111087.0 min cost 
> which need balance is 0.05
> {code}
> The big multiplier sum caught my eyes. Here was what additional debug logging 
> showed:
> {code}
> 2017-01-27 23:25:31,749 DEBUG 
> [RpcServer.deafult.FPBQ.Fifo.handler=9,queue=0,port=16000] 
> balancer.StochasticLoadBalancer: class 
> org.apache.hadoop.hbase.master.balancer.          
> StochasticLoadBalancer$RegionReplicaHostCostFunction with multiplier 100000.0
> 2017-01-27 23:25:31,749 DEBUG 
> [RpcServer.deafult.FPBQ.Fifo.handler=9,queue=0,port=16000] 
> balancer.StochasticLoadBalancer: class 
> org.apache.hadoop.hbase.master.balancer.          
> StochasticLoadBalancer$RegionReplicaRackCostFunction with multiplier 10000.0
> {code}
> Note however, that no table in the cluster used read replica.
> I can think of two ways of fixing this situation:
> 1. If there is no read replica in the cluster, ignore the multipliers for the 
> above two functions.
> 2. When cost() returned by the CostFunction is 0 (or very very close to 0.0), 
> ignore the multiplier.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to