Ivan Bessonov created IGNITE-20425:
--------------------------------------

             Summary: Corrupted Raft FSM state after restart
                 Key: IGNITE-20425
                 URL: https://issues.apache.org/jira/browse/IGNITE-20425
             Project: Ignite
          Issue Type: Bug
            Reporter: Ivan Bessonov


According to the protocol, there are several numeric indexes in the Log / FSM:
 * {{lastLogIndex}} - index of the last logged log entry.
 * {{committedIndex}} - index of last committed log entry. {{{}committedIndex 
<= lastLogIndex{}}}.
 * {{appliedIndex}} - index of last log entry, processed by the state machine.

If committed index is less then last index, RAFT can invoke the "truncate 
suffix" procedure and delete uncommitted log's tail. This is a valid thing to 
do.

Now, imagine the following scenario:
 * {{{}lastIndex == 12{}}}, {{committedIndex == 11}}
 * Node is restarted
 * Upon recovery, we replay the entire log. Now {{appliedIndex == 12}}
 * After recovery, we join the group and receive "truncate suffix command" in 
order to deleted uncommitted entries.
 * We must delete entry 12, but it's already applied. Peer is broken.

The reason is that we don't use default recovery procedure: 
{{org.apache.ignite.raft.jraft.core.NodeImpl#init}}

Canonical raft doesn't replay log before join is complete.

Down to earth scenario, that shows this situation in practice:
 * Start group with 3 nodes: A, B, and C.
 * We assume that A is a leader.
 * Shutdown A, leader re-election is triggered.
 * We assume that B votes for C.
 * C receives grant from B and proceeds writing new configuration into local 
log.
 * Shutdown B before it writes the same log entry (easily-reproducible race).
 * Shutdown C.
 * Restart cluster.

Resulting states:

A - [1: initial cfg]

B - [1: initial cfg]

C - [1: initial cfg, 2: re-election]
h3. How to fix

option a. Recover log after join. This is not optimal, it's like performing 
local recovery after cluster activation in Ignite 2. We fixed that behavior 
long time ago.

option b. Somehow track committed index and perform partial recovery, that 
guarantees safety. We could write committed index into log storage periodically.

"b" is better, but maybe there are other ways as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to