[
https://issues.apache.org/jira/browse/IGNITE-22904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17886348#comment-17886348
]
Roman Puchkovskiy commented on IGNITE-22904:
--------------------------------------------
The original approach described in IEP and in this issue (under 'Old
description (would not work)' would not work as it suggests to leave the
existing MG Raft storages (including snapshots and log) and 'patch' the
situation on top of them. But if a node that saw no MG repair is actually way
ahead (this might happen if 'old majority' did not die, but it became
unavailable for the user due to a network partition, but it was still operating
for some time), it could have applied a lot of log entries, made a snapshot and
truncated log prefix. If, after that, we try to migrate such a node to a
repaired cluster (where the MG leader has way less log entries), the leader
would try to truncate the ex-leader log suffix, but it would fail as part of
that suffix is already applied.
That's why an alternative approach is proposed, where a node that saw no
reparation destroys its MG Raft storages (and its state machine state) before
joining the MG, to let the new leader supply it with MG state via snapshot
installation and/or AppendEntries.
> Disallow old MG majority to hijack leadership
> ---------------------------------------------
>
> Key: IGNITE-22904
> URL: https://issues.apache.org/jira/browse/IGNITE-22904
> Project: Ignite
> Issue Type: Improvement
> Reporter: Roman Puchkovskiy
> Assignee: Roman Puchkovskiy
> Priority: Major
> Labels: iep-128, ignite-3
> Time Spent: 40m
> Remaining Estimate: 0h
>
> If some node did not see Metastorage repair, it will be migrated to the new
> cluster using the Migrate REST/CLI command. Such a node (judging by its local
> MG Raft log) might still think it's a member of the voting set, so it might
> propose itself as a candidate, and it can win the election if there are
> enough such nodes. This will result in the leadership being hijacked by the
> 'old' majority, which will mess the repaired Metastorage up. This has to be
> avoided.
> To do so, the following should be done:
> # In the CMG, add a property called metastorageRepairClusterId (empty in the
> blank cluster)
> # When, during MG repair (IGNITE-22899), we choose new metastorageNodes and
> save them to the CMG (which happens before resetPeers() is called), we write
> metastorageRepairClusterId together with metastorageNodes to the CMG
> # We add a property called witnessedMetastorageRepairClusterId to the Vault.
> This property will store clusterId for the incarnation of the cluster in
> which the node witnessed MG repair (either it participated in the repair, or
> it was migrated and successfully performed the 'MG reentry' procedure, see
> below. This property is empty on a blank node
> # When a node handles
> MetastorageIndexTermRequestMessage, it writes current clusterId to its
> Vault.witnessedMetastorageRepairClusterId. As a result, every node
> participating in the MG repair will be marked as a witness of the repair and
> we'll not need to do 'MG reentry' for them
> # On node start, before starting the MG, Ignite node gets from the CMG
> leader metastorageNodes and metastorageRepairClusterId. If it's not null and
> Vault.witnessedMetastorageRepairClusterId is absent or differs from
> metastorageRepairClusterId, then the node has to perform the 'MG reentry'
> procedure.
> # The 'MG reentry' procedure is as follows:
> ## The node destroys all 3 Raft storages for MG (these are meta, log,
> snapshot storage) as well as Metastorage KV storage
> ## Writes current clusterId to Vault.witnessedMetastorageRepairClusterId
> ## Then starts the MG Raft server as usual
> Another potential issue is that, when a node reenters and still has just a
> partial log, the part is has might tell it that it's a voting member. If at
> this moment the leader fails, the reentering node (which is not a member of
> the voting set in the latest configuration) might believe it IS such a member
> (as it only sees a part of the log) and hijack the leadership.
> To prevent this:
> # We'll add yet another property to the CMG, that is, metastorageRepairIndex.
> # When writing metastorageNodes to the CMG during repair, we also write
> metastorageRepairIndex
> # When doing the 'MG reentry' procedure, in Nodeimpl#init(), we store
> metastorageRepairIndex to a volatile field and, if the latest local config
> index is less than this index, we change the volatile config to the current
> metastorage nodes from the CMG (to prevent the reentered Raft node becoming a
> leader); when configuration is updated (coming from AppendEntries or
> installed snapshot), we check whether its index reached
> metastorageRepairIndex, and if no, we change the volatile configuration again.
> This also solves a potential ABA problem.
> h2. Old description (would not work)
> If, during a join (on getting the fresh cluster state from the CMG), a node
> detects that, according to the MG configuration saved in the MG on this node,
> this node is the member of the voting set (i.e. it’s a peer, not a learner),
> and this node is NOT one of the metastorageNodes in the CMG, then, before
> starting its MG Raft member, it raises a flag that disallows its Raft node
> becoming a candidate.
> (This flag does not exist in JRaft, we need to introduce it there; the flag
> is not persisted).
> As soon as the Raft node applies a new Raft configuration (coming from the
> new leader), this flag is cleared.
> After this, the Raft node is ‘converted’ to the new MG and cannot hijack the
> leadership.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)