[
https://issues.apache.org/jira/browse/IGNITE-16668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexander Lapin updated IGNITE-16668:
-------------------------------------
Summary: Design in-memory raft group reconfiguration on node failure (was:
Raft group reconfiguration on node failure)
> Design in-memory raft group reconfiguration on node failure
> -----------------------------------------------------------
>
> Key: IGNITE-16668
> URL: https://issues.apache.org/jira/browse/IGNITE-16668
> Project: Ignite
> Issue Type: Improvement
> Reporter: Sergey Chugunov
> Assignee: Alexander Lapin
> Priority: Major
> Labels: ignite-3
> Attachments: Screenshot from 2022-04-19 11-11-05.png, Screenshot from
> 2022-04-19 11-12-55-1.png, Screenshot from 2022-04-19 11-12-55.png
>
>
> If a node storing a partition of an in-memory table fails and leaves the
> cluster all data it had is lost. From the point of view of the partition it
> looks like as the node is left forever.
> Although Raft protocol tolerates leaving some amount of nodes composing Raft
> group (partition); for in-memory caches we cannot restore replica factor
> because of in-memory nature of the table.
> It means that we need to detect failures of each node owning a partition and
> recalculate assignments for the table without keeping replica factor.
> h4. Upd 1:
> h4. Problem
> By design raft has several persisted segments, e.g. raft meta
> (term/committedIndex) and stable raft log. So, by converting common raft to
> in-memory one it’s possible to break some of it’s invariants. For example
> Node C could vote for Candidate A before self-restart and vote then for
> Candidate B after one. As a result two leaders will be elected which is
> illegal.
>
> !Screenshot from 2022-04-19 11-11-05.png!
>
> h4. Solution
> In order to solve the problem mentioned above it’s possible to remove and
> then return back the restarting node from the peers of the corresponding raft
> group. The peer-removal process should be finished before the restarting of
> the corresponding raft server node.
>
> !Screenshot from 2022-04-19 11-12-55.png!
>
> The process of removing and then returning back the restarting node is
> however itself tricky. And to answer why it’s non-trivial action, it’s
> necessary to reveal the main ideas of the rebalance protocol.
> Reconfiguration of the raft group - is a process driven by the fact of
> changing the assignments. Each partition has three corresponding sets of
> assignments stored in the metastore:
> # assignments.stable - current distribution
> # assignments.pending - partition distribution for an ongoing rebalance if
> any
> # assignments.planned - in some cases it’s not possible to cancel or merge
> pending rebalance with new one. In that case newly calculated assignments
> will be stored explicitly with corresponding assignments.planned key. It's
> worth noting that it doesn't make sense to keep more than one planned
> rebalance. Any new scheduled one will overwrite already existing.
> However such idea of overwriting the assignments.planned key wont work within
> the context of an in-memory raft restart, because it’s not valid to overwrite
> the reduction of assignments. Let's illustrate this problem with the
> following example.
> # In-memory partition p1 is hosted on nodes A, B and C, meaning that
> p1.assignments.stable=[A,B,C]
> # Let's say that the baseline was changed, resulting in a rebalance on
> assignments.pending=[A,B,C,\\{*}D\\{*}]
> # During the non-cancelable phase of [A,B,C]->[A,B,C,D], node C fails and
> returns back, meaning that we should plan [A,B,D] and [A,B,C,D] assignments.
> Both must be recorded in the only assignments.planned key meaning that
> [A,B,C,D] will overwrite reduction [A,B,D], so no actual raft reconfiguration
> will take place, which is not acceptable.
> In order to overcome given issue, let’s introduce new key
> _assignments.switch_ that will hold nodes that should be removed and then
> returned back and run following actions:
> h5. On in-memory partition restart (or on partition start with cleaned-up
> PDS):
> h6. as-is:
> N/A
> h6. to-be:
> {code:java}
> metastoreInvoke*: // atomic metastore call through multi-invoke api
> if empty(partition.assignments.change.trigger.revision) ||
> partition.assignments.change.trigger.revision < event.revision:
> var assignmentsSwitch = union(assignments.switch<oldVal>,
> assignments.swith<newVal>)
>
> if empty(partition.assignments.pending)
> partition.assignments.pending =
> substract(partition.assignments.stable, assignmentsSwitch)
> partition.assignments.switch = assignmentsSwitch
> partition.assignments.change.trigger.revision = event.revision
> else:
> partition.assignments.switch = assignmentsSwitch
> partition.assignments.change.trigger.revision = event.revision
> else:
> skip
> {code}
> h5. On rebalance done
> h6. as-is:
> {code:java}
> metastoreInvoke: \\ atomic
> partition.assignments.stable = appliedPeers
> if empty(partition.assignments.planned):
> partition.assignments.pending = empty
> else:
> partition.assignments.pending = partition.assignments.planned {code}
> h6. to-be:
> {code:java}
> metastoreInvoke: \\ atomic
> partition.assignments.stable = appliedPeers
>
> if !empty(parition.assignments.switch)
> var stableSwitchSubtract = subtract(parition.assignments.switch,
> partition.assignments.stable)
> if !empty(stableSwitchSubtract) // return returned node
> partition.assignments.pending =
> union(partition.assignments.stable, pendingSwitchSubtract)
> partition.assignments.switch =
> subtract(partition.assignments.switch, pendingSwitchSubtract)
> partition.assignments.change.trigger.revision = event.revision
> else // subtract returned node
> partition.assignments.pending =
> subtract(partition.assignments.stable, parition.assignments.switch)
> partition.assignments.change.trigger.revision = event.revision
>
> else
> if empty(partition.assignments.planned):
> partition.assignments.pending = empty
> else:
> partition.assignments.pending = partition.assignments.planned
> {code}
> It’s also possible to update onCommonRebalance in a similar way, however
> it’s only an optimization because onRebalanceDone will always consider switch
> as a higher priority action in comparison to planned rebalance triggered by
> baseline change or replica factor change.
> Besides that, in many cases, especially for in-memory partitions, it is worth
> removing leaving node from raft group peers not within its restart but during
> corresponding networkTopology.onDisappeared() event. However It’s only an
> optimization. E.g. in the case of persisted table with cleaned up PDS it’s
> not known whether it was cleaned up until node self-check on table start up.
> Thus, the correctness of the algorithm is achieved due to the actions taken
> exclusively when the node is restarted - the rest is optimization.
> All in all general flow on node restart will look like following:
> {code:java}
> private void onParitionStart(Partition p) {
> if (p.table().inMememory() || !p.localStorage().isAvailable()) {
> if (p.isMajorityAvailable()) {
> <inline metaStorageInvoke*>
> // Do not start corresponding raft server!
> } else if (fullParitionRestart) {
> commonStart();
> } else {
> // Do nothing.
> }
> } else {
> commonStart();
> }
> } {code}
--
This message was sent by Atlassian Jira
(v8.20.7#820007)