[ https://issues.apache.org/jira/browse/HELIX-541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223719#comment-14223719 ]
Zhen Zhang commented on HELIX-541: ---------------------------------- Here is the problem. - At time t1: Node_0 is selected LEADER for partition_0, and we have a pending message, say STANDBY->LEADER for (partition_0, Node_0). - At time t2 (>t1): Node_0 finishes STANDBY->LEADER transition but hasn't deleted the message. Meantime, controller selects another LEADER for partition_0, say Node_1. Since Node_0's current state is LEADER and we are only considering toState in pending message, controller thinks it's OK to send OFFLINE->STANDBY t0 Node_1. - At time t3 (>t2): Node_0 in LEADER and Node_1 in STANDBY, controller now enters into the livelock where neither can it send STANDBY->LEADERl to Node_1, nor it can send LEADER->STANDBY to Node_0. The fix it to consider both fromState and toState in pending message, so controller will not send OFFLINE->STANDBY to Node_1 if Node_0 hasn't remove STANDBY->LEADER message. For the next Helix pipeline, since we have transition priority that LEADER->STANDBY > OFFLINE->STANDBY, controller will bring Node_0 to STANDBY then to OFFLINE first, avoiding the livelock. > Possible "livelock" in Helix controller > --------------------------------------- > > Key: HELIX-541 > URL: https://issues.apache.org/jira/browse/HELIX-541 > Project: Apache Helix > Issue Type: Bug > Reporter: Zhen Zhang > > We discover a "livelock" bug in Helix controller. This leads to HELIX-540. > Assuming we have 3 partitions and 2 nodes, using LeaderStandby state model, > FULL_AUTO mode, and replica is 2. When reaching stable mapping, we might have > the following: > {noformat} > partition_0: > node_0: LEADER > node_1: STANDBY > partition_1: > node_1: LEADER > node_0: STANDBY > partition_2: > node_0: LEADER > node_1: STANDBY > {noformat} > Later we add a new node (node_2) to the cluster and rebalancer decides that > node_2 should become LEADER for partition_2. So controller first sends > OFFLINE->STANDBY transition to node_2, and the mapping becomes: > {noformat} > partition_0: > node_0: LEADER > node_1: STANDBY > partition_1: > node_1: LEADER > node_0: STANDBY > partition_2: > node_0: LEADER > node_1: STANDBY > node_2: STANDBY > {noformat} > Note that given LEADER state count is 1 and STANDBY state count is R, where > R=2, it is implying the following state constraints: > {noformat} > LEADER: upper_bound=1 > STANDBY: upper_bound=2 > {noformat} > Helix controller now enters the "livelock": it can't send STANDBY->LEADER to > node_2, since this will violate LEADER upper bound; it can't send > LEADER->STANDBY to node_0 either, since will violate STANDBY upper bound. > We can solve the problem in several ways: > 1) State count definition is ambiguous. For some state, like LEADER, when we > say state_count=1, that means we can't violate this constraint at any time. > However, for some other state, like STANDBY, when we say state_count=R, that > means in stable mapping, there should be R-1 STANDBY replicas, but we don't > care the count in any transient state. In this case, we can set STANDBY > upper_bound to be larger than 2. Note that this doesn't solve the problem in > general. We may have some state model that has a restrict requirement on > STANDBY state. > 2) Using state transition priority. If we define LEADER->STANDBY transition > should have a higher priority than OFFLINE->STANDBY transition, it will solve > the "livelock". But this doesn't solve the problem in general either, because > when the state model gets complicated, it's hard to define and prove proper > transition priorities that avoid "livelock" in any situation. > 3) The root cause of the problem is that Helix controller uses a greedy > algorithm that only looks one step ahead. In the example, if Helix controller > can look two steps further, then it will find out that sending > OFFLINE->STANDBY transition to node_2 will lead to a dead end, therefore it > should choose to send LEADER->STANDBY to node_0 instead. In general we might > need to do a DFS/BFS and it's hard if state model is complicated and system > is large. > In practice, most systems use simple state models with less than 5 states and > have strict state constraint on a single state (e.g MASTER, LEADER) only. We > can avoid "livelock" by carefully choosing state constraints. -- This message was sent by Atlassian JIRA (v6.3.4#6332)