[ 
https://issues.apache.org/jira/browse/IGNITE-20425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-20425:
-----------------------------------
    Fix Version/s: 3.0.0-beta2

> Corrupted Raft FSM state after restart
> --------------------------------------
>
>                 Key: IGNITE-20425
>                 URL: https://issues.apache.org/jira/browse/IGNITE-20425
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.0.0-beta2
>
>
> According to the protocol, there are several numeric indexes in the Log / FSM:
>  * {{lastLogIndex}} - index of the last logged log entry.
>  * {{committedIndex}} - index of last committed log entry. {{{}committedIndex 
> <= lastLogIndex{}}}.
>  * {{appliedIndex}} - index of last log entry, processed by the state machine.
> If committed index is less then last index, RAFT can invoke the "truncate 
> suffix" procedure and delete uncommitted log's tail. This is a valid thing to 
> do.
> Now, imagine the following scenario:
>  * {{{}lastIndex == 12{}}}, {{committedIndex == 11}}
>  * Node is restarted
>  * Upon recovery, we replay the entire log. Now {{appliedIndex == 12}}
>  * After recovery, we join the group and receive "truncate suffix command" in 
> order to deleted uncommitted entries.
>  * We must delete entry 12, but it's already applied. Peer is broken.
> The reason is that we don't use default recovery procedure: 
> {{org.apache.ignite.raft.jraft.core.NodeImpl#init}}
> Canonical raft doesn't replay log before join is complete.
> Down to earth scenario, that shows this situation in practice:
>  * Start group with 3 nodes: A, B, and C.
>  * We assume that A is a leader.
>  * Shutdown A, leader re-election is triggered.
>  * We assume that B votes for C.
>  * C receives grant from B and proceeds writing new configuration into local 
> log.
>  * Shutdown B before it writes the same log entry (easily-reproducible race).
>  * Shutdown C.
>  * Restart cluster.
> Resulting states:
> A - [1: initial cfg]
> B - [1: initial cfg]
> C - [1: initial cfg, 2: re-election]
> h3. How to fix
> option a. Recover log after join. This is not optimal, it's like performing 
> local recovery after cluster activation in Ignite 2. We fixed that behavior 
> long time ago.
> option b. Somehow track committed index and perform partial recovery, that 
> guarantees safety. We could write committed index into log storage 
> periodically.
> "b" is better, but maybe there are other ways as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to