[ https://issues.apache.org/jira/browse/HELIX-541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221312#comment-14221312 ]
Zhen Zhang commented on HELIX-541: ---------------------------------- because the new ideal-state is: {noformat} node_2: LEADER node_1: STANDBY {noformat} and the current-state is: {noformat} node_0: LEADER node_1: STANDBY {noformat} controller figures out it has to send two transitions: {noformat} To node_0: LEADER->STANDBY To node_2: OFFLINE->STANDBY {noformat} node_1 remains in STANDBY so controller will not send any transition to it. > Possible "livelock" in Helix controller > --------------------------------------- > > Key: HELIX-541 > URL: https://issues.apache.org/jira/browse/HELIX-541 > Project: Apache Helix > Issue Type: Bug > Reporter: Zhen Zhang > > We discover a "livelock" bug in Helix controller. This leads to HELIX-540. > Assuming we have 3 partitions and 2 nodes, using LeaderStandby state model, > FULL_AUTO mode, and replica is 2. When reaching stable mapping, we might have > the following: > {noformat} > partition_0: > node_0: LEADER > node_1: STANDBY > partition_1: > node_1: LEADER > node_0: STANDBY > partition_2: > node_0: LEADER > node_1: STANDBY > {noformat} > Later we add a new node (node_2) to the cluster and rebalancer decides that > node_2 should become LEADER for partition_2. So controller first sends > OFFLINE->STANDBY transition to node_2, and the mapping becomes: > {noformat} > partition_0: > node_0: LEADER > node_1: STANDBY > partition_1: > node_1: LEADER > node_0: STANDBY > partition_2: > node_0: LEADER > node_1: STANDBY > node_2: STANDBY > {noformat} > Note that given LEADER state count is 1 and STANDBY state count is R, where > R=2, it is implying the following state constraints: > {noformat} > LEADER: upper_bound=1 > STANDBY: upper_bound=2 > {noformat} > Helix controller now enters the "livelock": it can't send STANDBY->LEADER to > node_2, since this will violate LEADER upper bound; it can't send > LEADER->STANDBY to node_0 either, since will violate STANDBY upper bound. > We can solve the problem in several ways: > 1) State count definition is ambiguous. For some state, like LEADER, when we > say state_count=1, that means we can't violate this constraint at any time. > However, for some other state, like STANDBY, when we say state_count=R, that > means in stable mapping, there should be R-1 STANDBY replicas, but we don't > care the count in any transient state. In this case, we can set STANDBY > upper_bound to be larger than 2. Note that this doesn't solve the problem in > general. We may have some state model that has a restrict requirement on > STANDBY state. > 2) Using state transition priority. If we define LEADER->STANDBY transition > should have a higher priority than OFFLINE->STANDBY transition, it will solve > the "livelock". But this doesn't solve the problem in general either, because > when the state model gets complicated, it's hard to define and prove proper > transition priorities that avoid "livelock" in any situation. > 3) The root cause of the problem is that Helix controller uses a greedy > algorithm that only looks one step ahead. In the example, if Helix controller > can look two steps further, then it will find out that sending > OFFLINE->STANDBY transition to node_2 will lead to a dead end, therefore it > should choose to send LEADER->STANDBY to node_0 instead. In general we might > need to do a DFS/BFS and it's hard if state model is complicated and system > is large. > In practice, most systems use simple state models with less than 5 states and > have strict state constraint on a single state (e.g MASTER, LEADER) only. We > can avoid "livelock" by carefully choosing state constraints. -- This message was sent by Atlassian JIRA (v6.3.4#6332)