[ 
https://issues.apache.org/jira/browse/IGNITE-16668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-16668:
-------------------------------------
    Attachment: Screenshot from 2022-04-19 11-12-55-1.png

> Raft group reconfiguration on node failure
> ------------------------------------------
>
>                 Key: IGNITE-16668
>                 URL: https://issues.apache.org/jira/browse/IGNITE-16668
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Sergey Chugunov
>            Assignee: Alexander Lapin
>            Priority: Major
>              Labels: ignite-3
>         Attachments: Screenshot from 2022-04-19 11-11-05.png, Screenshot from 
> 2022-04-19 11-12-55-1.png, Screenshot from 2022-04-19 11-12-55.png
>
>
> If a node storing a partition of an in-memory table fails and leaves the 
> cluster all data it had is lost. From the point of view of the partition it 
> looks like as the node is left forever.
> Although Raft protocol tolerates leaving some amount of nodes composing Raft 
> group (partition); for in-memory caches we cannot restore replica factor 
> because of in-memory nature of the table.
> It means that we need to detect failures of each node owning a partition and 
> recalculate assignments for the table without keeping replica factor.
> h4. Upd 1:
> h4. Problem
> By design raft has several persisted segments, e.g. raft meta 
> (term/committedIndex) and stable raft log. So, by converting common raft to 
> in-memory one it’s possible to break some of it’s invariants. For example 
> Node C could vote for Candidate A before self-restart and vote then for 
> Candidate B after one. As a result two leaders will be elected which is 
> illegal.
>  
> !Screenshot from 2022-04-19 11-11-05.png!
>  
> h4. Solution
> In order to solve the problem mentioned above it’s possible to remove and 
> then return back the restarting node from the peers of the corresponding raft 
> group. The peer-removal process should be finished before the restarting of 
> the corresponding raft server node.
>  
>  
>  
> The process of removing and then returning back the restarting node is 
> however itself tricky. And to answer why it’s non-trivial action, it’s 
> necessary to reveal the main ideas of the rebalance protocol.
> Reconfiguration of the raft group - is a process driven by the fact of 
> changing the assignments. Each partition has three corresponding sets of 
> assignments stored in the metastore:
>  # assignments.stable - current distribution
>  # assignments.pending - partition distribution for an ongoing rebalance if 
> any
>  # assignments.planned - in some cases it’s not possible to cancel or merge 
> pending rebalance with new one. In that case newly calculated assignments 
> will be stored explicitly with corresponding assignments.planned key. It's 
> worth noting that it doesn't make sense to keep more than one planned 
> rebalance. Any new scheduled one will overwrite already existing.
> However such idea of overwriting the assignments.planned key wont work within 
> the context of an in-memory raft restart, because it’s not valid to overwrite 
> the reduction of assignments. Let's illustrate this problem with the 
> following example.
>  # In-memory partition p1 is hosted on nodes A, B and C, meaning that 
> p1.assignments.stable=[A,B,C]
>  # Let's say that the baseline was changed, resulting in a rebalance on 
> assignments.pending=[A,B,C,{*}D{*}]
>  # During the non-cancelable phase of [A,B,C]->[A,B,C,D], node C fails and 
> returns back, meaning that we should plan [A,B,D] and [A,B,C,D] assignments. 
> Both must be recorded in the only assignments.planned key meaning that 
> [A,B,C,D] will overwrite reduction [A,B,D], so no actual raft reconfiguration 
> will take place, which is not acceptable.
> In order to overcome given issue, let’s introduce new key 
> _assignments.switch_ that will hold nodes that should be removed and then 
> returned back and run following actions:
> h5. On in-memory partition restart (or on partition start with cleaned-up 
> PDS):
> h6. as-is:
> N/A
> h6. to-be:
>  
> {code:java}
> metastoreInvoke*: // atomic metastore call through multi-invoke api
>     if empty(partition.assignments.change.trigger.revision) || 
> partition.assignments.change.trigger.revision < event.revision:
>         var assignmentsSwitch = union(assignments.switch<oldVal>, 
> assignments.swith<newVal>)
>          
>         if empty(partition.assignments.pending)
>             partition.assignments.pending = 
> substract(partition.assignments.stable, assignmentsSwitch) 
>             partition.assignments.switch = assignmentsSwitch
>             partition.assignments.change.trigger.revision = event.revision
>         else:
>             partition.assignments.switch = assignmentsSwitch
>             partition.assignments.change.trigger.revision = event.revision
>     else:
>         skip
> {code}
>  
> {{}}
> h5. On rebalance done
> h6. as-is:
>  
> {code:java}
> metastoreInvoke: \\ atomic
>     partition.assignments.stable = appliedPeers
>     if empty(partition.assignments.planned):
>         partition.assignments.pending = empty
>     else:
>         partition.assignments.pending = partition.assignments.planned {code}
>  
>  
> h6. to-be:
>  
> {code:java}
> metastoreInvoke: \\ atomic
>     partition.assignments.stable = appliedPeers
>     
>     if !empty(parition.assignments.switch)
>         var stableSwitchSubstract = substract(partition.assignments.stable, 
> parition.assignments.switch)
>         if !empty(pendingSwitchSubstract) // return returned node
>             partition.assignments.pending = 
> union(partition.assignments.stable, pendingSwitchSubstract) 
>             partition.assignments.switch = 
> substract(partition.assignments.switch, pendingSwitchSubstract)
>             partition.assignments.change.trigger.revision = event.revision
>         else // substruct returned node
>             partition.assignments.pending = 
> substract(partition.assignments.stable, parition.assignments.switch)
>             partition.assignments.change.trigger.revision = event.revision    
>  
>     else
>         if empty(partition.assignments.planned):
>             partition.assignments.pending = empty
>         else:
>             partition.assignments.pending = partition.assignments.planned 
> {code}
>  
> {{}}
> {{}}It’s also possible to update onCommonRebalance in a similar way, however 
> it’s only an optimization because onRebalanceDone will always consider switch 
> as a higher priority action in comparison to planned rebalance triggered by 
> baseline change or replica factor change.
> Besides that, in many cases, especially for in-memory partitions, it is worth 
> removing leaving node from raft group peers not within its restart but during 
> corresponding networkTopology.onDisappeared() event. However It’s only an 
> optimization. E.g. in the case of persisted table with cleaned up PDS it’s 
> not known whether it was cleaned up until node self-check on table start up. 
> Thus, the correctness of the algorithm is achieved due to the actions taken 
> exclusively when the node is restarted - the rest is optimization.
> All in all general flow on node restart will look like following:
>  
> {code:java}
> private void onParitionStart(Partition p) {
>         if (p.table().inMememory() || !p.localStorage().isAvailable()) {
>             if (p.isMajorityAvailable()) {
>                     <inline metaStorageInvoke*>
>                     // Do not start corresponding raft server!
>             } else if (fullParitionRestart) {
>                     commonStart();
>             } else {
>                     // Do nothing.
>             }
>         } else {
>             commonStart();
>         }
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to