[ https://issues.apache.org/jira/browse/IGNITE-17056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mirza Aliev updated IGNITE-17056: --------------------------------- Description: There are cases when a current leader cannot perform rebalance on specified set of nodes, for example, when some node from the raft group permanently fails with \{{RaftError#ECATCHUP}}. For such scenario retry mechanism is implemented in IGNITE-16801, but we cannot retry rebalance intent infinitely, so there should be implemented mechanism for canceling a rebalance. Naive canceling could be implemented by removing {{pending key}} and replacing it with {{planned key}}. But this approach has several crucial limitations and may cause inconsistency in the current rebalance protocol, for example, when there is a race between cancel and applying new assignment to the {{stable key}} from the new leader. We can remove {{pending key}} right before applying new assignment to the {{stable key}}, so we cannot resolve peers to ClusterIds, which is made on a union of pending and stable keys. Also there is a case, when we can lost planned rebalance: # Current leader retries failed rebalance # Current leader stops being leader for some reasons and sleeps # New leader performs rebalance and calls {{RebalanceRaftGroupEventsListener#onNewPeersConfigurationApplied}} # At this moment old leader wakes up and cancels the current rebalance, so it removes pending and writes to it planned key. # At this moment we receive {{RebalanceRaftGroupEventsListener#onNewPeersConfigurationApplied}} from the new leader, see that planned is empty, so we just delete pending key, but this is not correct to delete this key as far as the rebalance that is associated to the removed key hasn't been performed yet. Also we should consider separating scenarios for recoverable and unrecoverable errors, because it might be useless to retry rebalance, if some participating node fails with unrecoverable error. Seems like we should properly think about introducing some failure handling for such exceptional scenarios. was: There are cases when current leader cannot perform rebalance on specified set of nodes, for example, when some node from the raft group permanently fails with \{{RaftError#ECATCHUP}}. For such scenario retry mechanism is implemented in IGNITE-16801, but we cannot retry rebalance intent infinitely, so there should be implemented mechanism for canceling a rebalance. Naive canceling could be implemented by removing {{pending key}} and replacing it with {{planned key}}. But this approach has several crucial limitations and may cause inconsistency in the current rebalance protocol, for example, when there is a race between cancel and applying new assignment to the {{stable key}} from the new leader. We can remove {{pending key}} right before applying new assignment to the {{stable key}}, so we cannot resolve peers to ClusterIds, which is made on a union of pending and stable keys. Also there is a case, when we can lost planned rebalance: # Current leader retries failed rebalance # Current leader stops being leader for some reasons and sleeps # New leader performs rebalance and calls {{RebalanceRaftGroupEventsListener#onNewPeersConfigurationApplied}} # At this moment old leader wakes up and cancels the current rebalance, so it removes pending and writes to it planned key. # At this moment we receive {{RebalanceRaftGroupEventsListener#onNewPeersConfigurationApplied}} from the new leader, see that planned is empty, so we just delete pending key, but this is not correct to delete this key as far as the rebalance that is associated to the removed key hasn't been performed yet. Also we should consider separating scenarios for recoverable and unrecoverable errors, because it might be useless to retry rebalance, if some participating node fails with unrecoverable error. Seems like we should properly think about introducing some failure handling for such exceptional scenarios. > Implement rebalance cancel mechanism > ------------------------------------ > > Key: IGNITE-17056 > URL: https://issues.apache.org/jira/browse/IGNITE-17056 > Project: Ignite > Issue Type: Task > Reporter: Mirza Aliev > Priority: Major > Labels: ignite-3 > > There are cases when a current leader cannot perform rebalance on specified > set of nodes, for example, when some node from the raft group permanently > fails with \{{RaftError#ECATCHUP}}. For such scenario retry mechanism is > implemented in IGNITE-16801, but we cannot retry rebalance intent infinitely, > so there should be implemented mechanism for canceling a rebalance. > Naive canceling could be implemented by removing {{pending key}} and > replacing it with {{planned key}}. But this approach has several crucial > limitations and may cause inconsistency in the current rebalance protocol, > for example, when there is a race between cancel and applying new assignment > to the {{stable key}} from the new leader. We can remove {{pending key}} > right before applying new assignment to the {{stable key}}, so we cannot > resolve peers to ClusterIds, which is made on a union of pending and stable > keys. > Also there is a case, when we can lost planned rebalance: > # Current leader retries failed rebalance > # Current leader stops being leader for some reasons and sleeps > # New leader performs rebalance and calls > {{RebalanceRaftGroupEventsListener#onNewPeersConfigurationApplied}} > # At this moment old leader wakes up and cancels the current rebalance, so > it removes pending and writes to it planned key. > # At this moment we receive > {{RebalanceRaftGroupEventsListener#onNewPeersConfigurationApplied}} from the > new leader, see that planned is empty, so we just delete pending key, but > this is not correct to delete this key as far as the rebalance that is > associated to the removed key hasn't been performed yet. > Also we should consider separating scenarios for recoverable and > unrecoverable errors, because it might be useless to retry rebalance, if some > participating node fails with unrecoverable error. > Seems like we should properly think about introducing some failure handling > for such exceptional scenarios. > -- This message was sent by Atlassian Jira (v8.20.7#820007)