[
https://issues.apache.org/jira/browse/IGNITE-20425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Bessonov updated IGNITE-20425:
-----------------------------------
Description:
According to the protocol, there are several numeric indexes in the Log / FSM:
* {{lastLogIndex}} - index of the last logged log entry.
* {{committedIndex}} - index of last committed log entry. {{{}committedIndex
<= lastLogIndex{}}}.
* {{appliedIndex}} - index of last log entry, processed by the state machine.
{{appliedIndex <= }}{{{}committedIndex{}}}.
If committed index is less then last index, RAFT can invoke the "truncate
suffix" procedure and delete uncommitted log's tail. This is a valid thing to
do.
Now, imagine the following scenario:
* {{{}lastIndex == 12{}}}, {{committedIndex == 11}}
* Node is restarted
* Upon recovery, we replay the entire log. Now {{appliedIndex == 12}}
* After recovery, we join the group and receive "truncate suffix command" in
order to deleted uncommitted entries.
* We must delete entry 12, but it's already applied. Peer is broken.
The reason is that we don't use default recovery procedure:
{{org.apache.ignite.raft.jraft.core.NodeImpl#init}}
Canonical raft doesn't replay log before join is complete.
Down to earth scenario, that shows this situation in practice:
* Start group with 3 nodes: A, B, and C.
* We assume that A is a leader.
* Shutdown A, leader re-election is triggered.
* We assume that B votes for C.
* C receives grant from B and proceeds writing new configuration into local
log.
* Shutdown B before it writes the same log entry (easily-reproducible race).
* Shutdown C.
* Restart cluster.
Resulting states:
A - [1: initial cfg]
B - [1: initial cfg]
C - [1: initial cfg, 2: re-election]
h3. How to fix
option a. Recover log after join. This is not optimal, it's like performing
local recovery after cluster activation in Ignite 2. We fixed that behavior
long time ago.
option b. Somehow track committed index and perform partial recovery, that
guarantees safety. We could write committed index into log storage periodically.
"b" is better, but maybe there are other ways as well.
was:
According to the protocol, there are several numeric indexes in the Log / FSM:
* {{lastLogIndex}} - index of the last logged log entry.
* {{committedIndex}} - index of last committed log entry. {{{}committedIndex
<= lastLogIndex{}}}.
* {{appliedIndex}} - index of last log entry, processed by the state machine.
{{appliedIndex <= }}{{{}committedIndex{}}}.{{{}{}}}
If committed index is less then last index, RAFT can invoke the "truncate
suffix" procedure and delete uncommitted log's tail. This is a valid thing to
do.
Now, imagine the following scenario:
* {{{}lastIndex == 12{}}}, {{committedIndex == 11}}
* Node is restarted
* Upon recovery, we replay the entire log. Now {{appliedIndex == 12}}
* After recovery, we join the group and receive "truncate suffix command" in
order to deleted uncommitted entries.
* We must delete entry 12, but it's already applied. Peer is broken.
The reason is that we don't use default recovery procedure:
{{org.apache.ignite.raft.jraft.core.NodeImpl#init}}
Canonical raft doesn't replay log before join is complete.
Down to earth scenario, that shows this situation in practice:
* Start group with 3 nodes: A, B, and C.
* We assume that A is a leader.
* Shutdown A, leader re-election is triggered.
* We assume that B votes for C.
* C receives grant from B and proceeds writing new configuration into local
log.
* Shutdown B before it writes the same log entry (easily-reproducible race).
* Shutdown C.
* Restart cluster.
Resulting states:
A - [1: initial cfg]
B - [1: initial cfg]
C - [1: initial cfg, 2: re-election]
h3. How to fix
option a. Recover log after join. This is not optimal, it's like performing
local recovery after cluster activation in Ignite 2. We fixed that behavior
long time ago.
option b. Somehow track committed index and perform partial recovery, that
guarantees safety. We could write committed index into log storage periodically.
"b" is better, but maybe there are other ways as well.
> Corrupted Raft FSM state after restart
> --------------------------------------
>
> Key: IGNITE-20425
> URL: https://issues.apache.org/jira/browse/IGNITE-20425
> Project: Ignite
> Issue Type: Bug
> Reporter: Ivan Bessonov
> Priority: Major
> Labels: ignite-3
> Fix For: 3.0.0-beta2
>
>
> According to the protocol, there are several numeric indexes in the Log / FSM:
> * {{lastLogIndex}} - index of the last logged log entry.
> * {{committedIndex}} - index of last committed log entry. {{{}committedIndex
> <= lastLogIndex{}}}.
> * {{appliedIndex}} - index of last log entry, processed by the state
> machine. {{appliedIndex <= }}{{{}committedIndex{}}}.
> If committed index is less then last index, RAFT can invoke the "truncate
> suffix" procedure and delete uncommitted log's tail. This is a valid thing to
> do.
> Now, imagine the following scenario:
> * {{{}lastIndex == 12{}}}, {{committedIndex == 11}}
> * Node is restarted
> * Upon recovery, we replay the entire log. Now {{appliedIndex == 12}}
> * After recovery, we join the group and receive "truncate suffix command" in
> order to deleted uncommitted entries.
> * We must delete entry 12, but it's already applied. Peer is broken.
> The reason is that we don't use default recovery procedure:
> {{org.apache.ignite.raft.jraft.core.NodeImpl#init}}
> Canonical raft doesn't replay log before join is complete.
> Down to earth scenario, that shows this situation in practice:
> * Start group with 3 nodes: A, B, and C.
> * We assume that A is a leader.
> * Shutdown A, leader re-election is triggered.
> * We assume that B votes for C.
> * C receives grant from B and proceeds writing new configuration into local
> log.
> * Shutdown B before it writes the same log entry (easily-reproducible race).
> * Shutdown C.
> * Restart cluster.
> Resulting states:
> A - [1: initial cfg]
> B - [1: initial cfg]
> C - [1: initial cfg, 2: re-election]
> h3. How to fix
> option a. Recover log after join. This is not optimal, it's like performing
> local recovery after cluster activation in Ignite 2. We fixed that behavior
> long time ago.
> option b. Somehow track committed index and perform partial recovery, that
> guarantees safety. We could write committed index into log storage
> periodically.
> "b" is better, but maybe there are other ways as well.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)