[
https://issues.apache.org/jira/browse/IGNITE-20425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Bessonov updated IGNITE-20425:
-----------------------------------
Fix Version/s: 3.0.0-beta2
> Corrupted Raft FSM state after restart
> --------------------------------------
>
> Key: IGNITE-20425
> URL: https://issues.apache.org/jira/browse/IGNITE-20425
> Project: Ignite
> Issue Type: Bug
> Reporter: Ivan Bessonov
> Priority: Major
> Labels: ignite-3
> Fix For: 3.0.0-beta2
>
>
> According to the protocol, there are several numeric indexes in the Log / FSM:
> * {{lastLogIndex}} - index of the last logged log entry.
> * {{committedIndex}} - index of last committed log entry. {{{}committedIndex
> <= lastLogIndex{}}}.
> * {{appliedIndex}} - index of last log entry, processed by the state machine.
> If committed index is less then last index, RAFT can invoke the "truncate
> suffix" procedure and delete uncommitted log's tail. This is a valid thing to
> do.
> Now, imagine the following scenario:
> * {{{}lastIndex == 12{}}}, {{committedIndex == 11}}
> * Node is restarted
> * Upon recovery, we replay the entire log. Now {{appliedIndex == 12}}
> * After recovery, we join the group and receive "truncate suffix command" in
> order to deleted uncommitted entries.
> * We must delete entry 12, but it's already applied. Peer is broken.
> The reason is that we don't use default recovery procedure:
> {{org.apache.ignite.raft.jraft.core.NodeImpl#init}}
> Canonical raft doesn't replay log before join is complete.
> Down to earth scenario, that shows this situation in practice:
> * Start group with 3 nodes: A, B, and C.
> * We assume that A is a leader.
> * Shutdown A, leader re-election is triggered.
> * We assume that B votes for C.
> * C receives grant from B and proceeds writing new configuration into local
> log.
> * Shutdown B before it writes the same log entry (easily-reproducible race).
> * Shutdown C.
> * Restart cluster.
> Resulting states:
> A - [1: initial cfg]
> B - [1: initial cfg]
> C - [1: initial cfg, 2: re-election]
> h3. How to fix
> option a. Recover log after join. This is not optimal, it's like performing
> local recovery after cluster activation in Ignite 2. We fixed that behavior
> long time ago.
> option b. Somehow track committed index and perform partial recovery, that
> guarantees safety. We could write committed index into log storage
> periodically.
> "b" is better, but maybe there are other ways as well.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)