[ 
https://issues.apache.org/jira/browse/IGNITE-17056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mirza Aliev updated IGNITE-17056:
---------------------------------
    Epic Link: IGNITE-14209

> Implement rebalance cancel mechanism
> ------------------------------------
>
>                 Key: IGNITE-17056
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17056
>             Project: Ignite
>          Issue Type: Task
>            Reporter: Mirza Aliev
>            Priority: Major
>              Labels: ignite-3
>
> There are cases when a current leader cannot perform rebalance on specified 
> set of nodes, for example, when some node from the raft group permanently 
> fails with \{{RaftError#ECATCHUP}}. For such scenario retry mechanism is 
> implemented in IGNITE-16801, but we cannot retry rebalance intent infinitely, 
> so there should be implemented mechanism for canceling a rebalance. 
> Naive canceling could be implemented by removing {{pending key}} and 
> replacing it with {{planned key}}. But this approach has several crucial 
> limitations and may cause inconsistency in the current rebalance protocol, 
> for example, when there is a race between cancel and applying new assignment 
> to the {{stable key}} from the new leader. We can remove {{pending key}} 
> right before applying new assignment to the {{stable key}}, so we cannot 
> resolve peers to ClusterIds, which is made on a union of pending and stable 
> keys. 
> Also there is a case, when we can lost planned rebalance:
>  # Current leader retries failed rebalance
>  # Current leader stops being leader for some reasons and sleeps
>  # New leader performs rebalance and calls 
> {{RebalanceRaftGroupEventsListener#onNewPeersConfigurationApplied}}
>  # At this moment old leader wakes up and cancels the current rebalance, so 
> it removes pending and writes to it planned key.
>  # At this moment we receive 
> {{RebalanceRaftGroupEventsListener#onNewPeersConfigurationApplied}} from the 
> new leader, see that planned is empty, so we just delete pending key, but 
> this is not correct to delete this key as far as the rebalance that is 
> associated to the removed key hasn't been performed yet.
> Also we should consider separating scenarios for recoverable and 
> unrecoverable errors, because it might be useless to retry rebalance, if some 
> participating node fails with unrecoverable error. 
> Seems like we should properly think about introducing some failure handling 
> for such exceptional scenarios. 
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to