[ 
https://issues.apache.org/jira/browse/HELIX-541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223719#comment-14223719
 ] 

Zhen Zhang commented on HELIX-541:
----------------------------------

Here is the problem.
- At time t1: Node_0 is selected LEADER for partition_0, and we have a pending 
message, say STANDBY->LEADER for (partition_0, Node_0).

- At time t2 (>t1): Node_0 finishes STANDBY->LEADER transition but hasn't 
deleted the message. Meantime, controller selects another LEADER for 
partition_0, say Node_1. Since Node_0's current state is LEADER and we are only 
considering toState in pending message, controller thinks it's OK to send 
OFFLINE->STANDBY t0 Node_1.

- At time t3 (>t2): Node_0 in LEADER and Node_1 in STANDBY, controller now 
enters into the livelock where neither can it send STANDBY->LEADERl to Node_1, 
nor it can send LEADER->STANDBY to Node_0.

The fix it to consider both fromState and toState in pending message, so 
controller will not send OFFLINE->STANDBY to Node_1 if Node_0 hasn't remove 
STANDBY->LEADER message. For the next Helix pipeline, since we have transition 
priority that LEADER->STANDBY > OFFLINE->STANDBY, controller will bring Node_0 
to STANDBY then to OFFLINE first, avoiding the livelock.

> Possible "livelock" in Helix controller
> ---------------------------------------
>
>                 Key: HELIX-541
>                 URL: https://issues.apache.org/jira/browse/HELIX-541
>             Project: Apache Helix
>          Issue Type: Bug
>            Reporter: Zhen Zhang
>
> We discover a "livelock" bug in Helix controller. This leads to HELIX-540. 
> Assuming we have 3 partitions and 2 nodes, using LeaderStandby state model, 
> FULL_AUTO mode, and replica is 2. When reaching stable mapping, we might have 
> the following:
> {noformat}
> partition_0:
>   node_0: LEADER
>   node_1: STANDBY
> partition_1:
>   node_1: LEADER
>   node_0: STANDBY
> partition_2:
>   node_0: LEADER
>   node_1: STANDBY
> {noformat}
> Later we add a new node (node_2) to the cluster and rebalancer decides that 
> node_2 should become LEADER for partition_2. So controller first sends 
> OFFLINE->STANDBY transition to node_2, and the mapping becomes:
> {noformat}
> partition_0:
>   node_0: LEADER
>   node_1: STANDBY
> partition_1:
>   node_1: LEADER
>   node_0: STANDBY
> partition_2:
>   node_0: LEADER
>   node_1: STANDBY
>   node_2: STANDBY
> {noformat}
> Note that given LEADER state count is 1 and STANDBY state count is R, where 
> R=2, it is implying the following state constraints:
> {noformat}
> LEADER: upper_bound=1
> STANDBY: upper_bound=2
> {noformat}
> Helix controller now enters the "livelock": it can't send STANDBY->LEADER to 
> node_2, since this will violate LEADER upper bound; it can't send 
> LEADER->STANDBY to node_0 either, since will violate STANDBY upper bound.
> We can solve the problem in several ways:
> 1) State count definition is ambiguous. For some state, like LEADER, when we 
> say state_count=1, that means we can't violate this constraint at any time. 
> However, for some other state, like STANDBY, when we say state_count=R, that 
> means in stable mapping, there should be R-1 STANDBY replicas, but we don't 
> care the count in any transient state. In this case, we can set STANDBY 
> upper_bound to be larger than 2. Note that this doesn't solve the problem in 
> general. We may have some state model that has a restrict requirement on 
> STANDBY state.
> 2) Using state transition priority. If we define LEADER->STANDBY transition 
> should have a higher priority than OFFLINE->STANDBY transition, it will solve 
> the "livelock". But this doesn't solve the problem in general either, because 
> when the state model gets complicated, it's hard to define and prove proper 
> transition priorities that avoid "livelock" in any situation.
> 3) The root cause of the problem is that Helix controller uses a greedy 
> algorithm that only looks one step ahead. In the example, if Helix controller 
> can look two steps further, then it will find out that sending 
> OFFLINE->STANDBY transition to node_2 will lead to a dead end, therefore it 
> should choose to send LEADER->STANDBY to node_0 instead. In general we might 
> need to do a DFS/BFS and it's hard if state model is complicated and system 
> is large.
> In practice, most systems use simple state models with less than 5 states and 
> have strict state constraint on a single state (e.g MASTER, LEADER) only.  We 
> can avoid "livelock" by carefully choosing state constraints.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to