[
https://issues.apache.org/jira/browse/IGNITE-20425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexander Lapin updated IGNITE-20425:
-------------------------------------
Ignite Flags: (was: Docs Required,Release Notes Required)
> Corrupted Raft FSM state after restart
> --------------------------------------
>
> Key: IGNITE-20425
> URL: https://issues.apache.org/jira/browse/IGNITE-20425
> Project: Ignite
> Issue Type: Bug
> Reporter: Ivan Bessonov
> Assignee: Alexander Lapin
> Priority: Major
> Labels: ignite-3
> Fix For: 3.0.0-beta2
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> According to the protocol, there are several numeric indexes in the Log / FSM:
> * {{lastLogIndex}} - index of the last logged log entry.
> * {{committedIndex}} - index of last committed log entry. {{{}committedIndex
> <= lastLogIndex{}}}.
> * {{appliedIndex}} - index of last log entry, processed by the state
> machine. {{appliedIndex <= }}{{{}committedIndex{}}}.
> If committed index is less then last index, RAFT can invoke the "truncate
> suffix" procedure and delete uncommitted log's tail. This is a valid thing to
> do.
> Now, imagine the following scenario:
> * {{{}lastIndex == 12{}}}, {{committedIndex == 11}}
> * Node is restarted
> * Upon recovery, we replay the entire log. Now {{appliedIndex == 12}}
> * After recovery, we join the group and receive "truncate suffix command" in
> order to deleted uncommitted entries.
> * We must delete entry 12, but it's already applied. Peer is broken.
> The reason is that we don't use default recovery procedure:
> {{org.apache.ignite.raft.jraft.core.NodeImpl#init}}
> Canonical raft doesn't replay log before join is complete.
> Down to earth scenario, that shows this situation in practice:
> * Start group with 3 nodes: A, B, and C.
> * We assume that A is a leader.
> * Shutdown A, leader re-election is triggered.
> * We assume that B votes for C.
> * C receives grant from B and proceeds writing new configuration into local
> log.
> * Shutdown B before it writes the same log entry (easily-reproducible race).
> * Shutdown C.
> * Restart cluster.
> Resulting states:
> A - [1: initial cfg]
> B - [1: initial cfg]
> C - [1: initial cfg, 2: re-election]
> h3. How to fix
> option a. Recover log after join. This is not optimal, it's like performing
> local recovery after cluster activation in Ignite 2. We fixed that behavior
> long time ago.
> option b. Somehow track committed index and perform partial recovery, that
> guarantees safety. We could write committed index into log storage
> periodically.
> "b" is better, but maybe there are other ways as well.
> h3. Upd #1
> Highly likely we just can remove all that await log replay code on raft node
> start just because it’s no longer needed. Eventually it was introduced in
> order to enable primary replica direct storage reads, which is now covered
> properly within
> {code:java}
> /**
> * Tries to read index from group leader and wait for this index to appear in
> local storage. Can possible return failed future with
> * timeout exception, and in this case, replica would not answer to placement
> driver, because the response is useless. Placement driver
> * should handle this.
> *
> * @param expirationTime Lease expiration time.
> * @return Future that is completed when local storage catches up the index
> that is actual for leader on the moment of request.
> */
> private CompletableFuture<Void> waitForActualState(long expirationTime) {
> LOG.info("Waiting for actual storage state, group=" + groupId());
> long timeout = expirationTime - currentTimeMillis();
> if (timeout <= 0) {
> return failedFuture(new TimeoutException());
> }
> return retryOperationUntilSuccess(raftClient::readIndex, e ->
> currentTimeMillis() > expirationTime, executor)
> .orTimeout(timeout, TimeUnit.MILLISECONDS)
> .thenCompose(storageIndexTracker::waitFor);
> }{code}
> similar is about RO access, we await the safeTime that has HB relations with
> corresponding storage updates.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)