[jira] [Commented] (IGNITE-23566) Investigate possible races between resetPartitions and infinite rebalance retries

Kirill Gusakov (Jira) Thu, 07 Nov 2024 10:06:01 -0800


    [ 
https://issues.apache.org/jira/browse/IGNITE-23566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17896433#comment-17896433
 ]


Kirill Gusakov commented on IGNITE-23566:
-----------------------------------------

In fact, every rebalance from the raft node point of view:
 * It's the transition of ConfigurationCtx.stage values STAGE_NONE -> 
STAGE_CATCHING_UP -> STAGE_JOINT -> STAGE_STABLE -> STAGE_NONE
 * If stage is not STAGE_NONE, all other concurrent configuration update will 
receive the RaftError.EBUSY
 * On the any errors we will restart the rebalance process with *the current 
term* again and again with *the target configuration of the current rebalance*

At the same time, any resetPeers operation:
 * Rewrite rebalance pendings assignments
 * Check that the state of configurationCtx is not busy
 * Rewrite node configuration
 * Increment node term
 * Run the appropriate changePeersAndLearnersAsync 

Despite the different possible races between the pending rewrite, reset 
partition onReconfigurationError and on error changePeersAndLearnersAsync:
 * The fact that the onReconfigurationError use the current term and the 
resetPeers increment it on the success reset - save us from the old infinite 
onReconfigurationError listeners. More over, we will stop 
onReconfigurationError cycle if receive the OK status after the request with 
expired term

But, during the investigation I found out another issues and created 
appropriate tickets:
 * If changePeersAndLearnersAsync request failed and sendWithRetry can't fix it 
- we will fail the whole metastore notification future, but we should retry 
instead https://issues.apache.org/jira/browse/IGNITE-23633
 * We have a bug: error on resetPeers is not handling in any way. It's a 
problem for the manual reset at the current time 
https://issues.apache.org/jira/browse/IGNITE-23635

Nice to have:
 * Configuration for onReconfigurationError retries delay 
https://issues.apache.org/jira/browse/IGNITE-23634

> Investigate possible races between resetPartitions and infinite rebalance 
> retries
> ---------------------------------------------------------------------------------
>
>                 Key: IGNITE-23566
>                 URL: https://issues.apache.org/jira/browse/IGNITE-23566
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Kirill Gusakov
>            Assignee: Kirill Gusakov
>            Priority: Major
>              Labels: ignite-3
>
> *Motivation*
> For now our rebalance fail-over is a pretty trivial infinite loop of retries:
> - on the any issues on the catch up phase or later we call the 
> onReconfigurationError listener
> - for now this listener just count the retries and call 
> changePeersAndLearnersAsync logic again and again
> At the same time, we can call the resetPartitions logic and rewrite pending 
> assignments, potentially at the any moment. So, we can have a race between 
> rebalance retries and resetPartitions.
> *Definition of done*
> Under this ticket we need to investigate all possible issues, if any, and 
> create appropriate issues to resolve.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (IGNITE-23566) Investigate possible races between resetPartitions and infinite rebalance retries

Reply via email to