[ https://issues.apache.org/jira/browse/IGNITE-16668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexander Lapin updated IGNITE-16668: ------------------------------------- Attachment: Screenshot from 2022-04-19 11-12-55-1.png > Raft group reconfiguration on node failure > ------------------------------------------ > > Key: IGNITE-16668 > URL: https://issues.apache.org/jira/browse/IGNITE-16668 > Project: Ignite > Issue Type: Improvement > Reporter: Sergey Chugunov > Assignee: Alexander Lapin > Priority: Major > Labels: ignite-3 > Attachments: Screenshot from 2022-04-19 11-11-05.png, Screenshot from > 2022-04-19 11-12-55-1.png, Screenshot from 2022-04-19 11-12-55.png > > > If a node storing a partition of an in-memory table fails and leaves the > cluster all data it had is lost. From the point of view of the partition it > looks like as the node is left forever. > Although Raft protocol tolerates leaving some amount of nodes composing Raft > group (partition); for in-memory caches we cannot restore replica factor > because of in-memory nature of the table. > It means that we need to detect failures of each node owning a partition and > recalculate assignments for the table without keeping replica factor. > h4. Upd 1: > h4. Problem > By design raft has several persisted segments, e.g. raft meta > (term/committedIndex) and stable raft log. So, by converting common raft to > in-memory one it’s possible to break some of it’s invariants. For example > Node C could vote for Candidate A before self-restart and vote then for > Candidate B after one. As a result two leaders will be elected which is > illegal. > > !Screenshot from 2022-04-19 11-11-05.png! > > h4. Solution > In order to solve the problem mentioned above it’s possible to remove and > then return back the restarting node from the peers of the corresponding raft > group. The peer-removal process should be finished before the restarting of > the corresponding raft server node. > > > > The process of removing and then returning back the restarting node is > however itself tricky. And to answer why it’s non-trivial action, it’s > necessary to reveal the main ideas of the rebalance protocol. > Reconfiguration of the raft group - is a process driven by the fact of > changing the assignments. Each partition has three corresponding sets of > assignments stored in the metastore: > # assignments.stable - current distribution > # assignments.pending - partition distribution for an ongoing rebalance if > any > # assignments.planned - in some cases it’s not possible to cancel or merge > pending rebalance with new one. In that case newly calculated assignments > will be stored explicitly with corresponding assignments.planned key. It's > worth noting that it doesn't make sense to keep more than one planned > rebalance. Any new scheduled one will overwrite already existing. > However such idea of overwriting the assignments.planned key wont work within > the context of an in-memory raft restart, because it’s not valid to overwrite > the reduction of assignments. Let's illustrate this problem with the > following example. > # In-memory partition p1 is hosted on nodes A, B and C, meaning that > p1.assignments.stable=[A,B,C] > # Let's say that the baseline was changed, resulting in a rebalance on > assignments.pending=[A,B,C,{*}D{*}] > # During the non-cancelable phase of [A,B,C]->[A,B,C,D], node C fails and > returns back, meaning that we should plan [A,B,D] and [A,B,C,D] assignments. > Both must be recorded in the only assignments.planned key meaning that > [A,B,C,D] will overwrite reduction [A,B,D], so no actual raft reconfiguration > will take place, which is not acceptable. > In order to overcome given issue, let’s introduce new key > _assignments.switch_ that will hold nodes that should be removed and then > returned back and run following actions: > h5. On in-memory partition restart (or on partition start with cleaned-up > PDS): > h6. as-is: > N/A > h6. to-be: > > {code:java} > metastoreInvoke*: // atomic metastore call through multi-invoke api > if empty(partition.assignments.change.trigger.revision) || > partition.assignments.change.trigger.revision < event.revision: > var assignmentsSwitch = union(assignments.switch<oldVal>, > assignments.swith<newVal>) > > if empty(partition.assignments.pending) > partition.assignments.pending = > substract(partition.assignments.stable, assignmentsSwitch) > partition.assignments.switch = assignmentsSwitch > partition.assignments.change.trigger.revision = event.revision > else: > partition.assignments.switch = assignmentsSwitch > partition.assignments.change.trigger.revision = event.revision > else: > skip > {code} > > {{}} > h5. On rebalance done > h6. as-is: > > {code:java} > metastoreInvoke: \\ atomic > partition.assignments.stable = appliedPeers > if empty(partition.assignments.planned): > partition.assignments.pending = empty > else: > partition.assignments.pending = partition.assignments.planned {code} > > > h6. to-be: > > {code:java} > metastoreInvoke: \\ atomic > partition.assignments.stable = appliedPeers > > if !empty(parition.assignments.switch) > var stableSwitchSubstract = substract(partition.assignments.stable, > parition.assignments.switch) > if !empty(pendingSwitchSubstract) // return returned node > partition.assignments.pending = > union(partition.assignments.stable, pendingSwitchSubstract) > partition.assignments.switch = > substract(partition.assignments.switch, pendingSwitchSubstract) > partition.assignments.change.trigger.revision = event.revision > else // substruct returned node > partition.assignments.pending = > substract(partition.assignments.stable, parition.assignments.switch) > partition.assignments.change.trigger.revision = event.revision > > else > if empty(partition.assignments.planned): > partition.assignments.pending = empty > else: > partition.assignments.pending = partition.assignments.planned > {code} > > {{}} > {{}}It’s also possible to update onCommonRebalance in a similar way, however > it’s only an optimization because onRebalanceDone will always consider switch > as a higher priority action in comparison to planned rebalance triggered by > baseline change or replica factor change. > Besides that, in many cases, especially for in-memory partitions, it is worth > removing leaving node from raft group peers not within its restart but during > corresponding networkTopology.onDisappeared() event. However It’s only an > optimization. E.g. in the case of persisted table with cleaned up PDS it’s > not known whether it was cleaned up until node self-check on table start up. > Thus, the correctness of the algorithm is achieved due to the actions taken > exclusively when the node is restarted - the rest is optimization. > All in all general flow on node restart will look like following: > > {code:java} > private void onParitionStart(Partition p) { > if (p.table().inMememory() || !p.localStorage().isAvailable()) { > if (p.isMajorityAvailable()) { > <inline metaStorageInvoke*> > // Do not start corresponding raft server! > } else if (fullParitionRestart) { > commonStart(); > } else { > // Do nothing. > } > } else { > commonStart(); > } > } {code} > -- This message was sent by Atlassian Jira (v8.20.7#820007)