Ivan Bessonov created IGNITE-20425:
--------------------------------------
Summary: Corrupted Raft FSM state after restart
Key: IGNITE-20425
URL: https://issues.apache.org/jira/browse/IGNITE-20425
Project: Ignite
Issue Type: Bug
Reporter: Ivan Bessonov
According to the protocol, there are several numeric indexes in the Log / FSM:
* {{lastLogIndex}} - index of the last logged log entry.
* {{committedIndex}} - index of last committed log entry. {{{}committedIndex
<= lastLogIndex{}}}.
* {{appliedIndex}} - index of last log entry, processed by the state machine.
If committed index is less then last index, RAFT can invoke the "truncate
suffix" procedure and delete uncommitted log's tail. This is a valid thing to
do.
Now, imagine the following scenario:
* {{{}lastIndex == 12{}}}, {{committedIndex == 11}}
* Node is restarted
* Upon recovery, we replay the entire log. Now {{appliedIndex == 12}}
* After recovery, we join the group and receive "truncate suffix command" in
order to deleted uncommitted entries.
* We must delete entry 12, but it's already applied. Peer is broken.
The reason is that we don't use default recovery procedure:
{{org.apache.ignite.raft.jraft.core.NodeImpl#init}}
Canonical raft doesn't replay log before join is complete.
Down to earth scenario, that shows this situation in practice:
* Start group with 3 nodes: A, B, and C.
* We assume that A is a leader.
* Shutdown A, leader re-election is triggered.
* We assume that B votes for C.
* C receives grant from B and proceeds writing new configuration into local
log.
* Shutdown B before it writes the same log entry (easily-reproducible race).
* Shutdown C.
* Restart cluster.
Resulting states:
A - [1: initial cfg]
B - [1: initial cfg]
C - [1: initial cfg, 2: re-election]
h3. How to fix
option a. Recover log after join. This is not optimal, it's like performing
local recovery after cluster activation in Ignite 2. We fixed that behavior
long time ago.
option b. Somehow track committed index and perform partial recovery, that
guarantees safety. We could write committed index into log storage periodically.
"b" is better, but maybe there are other ways as well.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)