[
https://issues.apache.org/jira/browse/IGNITE-21062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexander Lapin updated IGNITE-21062:
-------------------------------------
Reviewer: Ivan Bessonov
> Safe time reordering in partitions
> ----------------------------------
>
> Key: IGNITE-21062
> URL: https://issues.apache.org/jira/browse/IGNITE-21062
> Project: Ignite
> Issue Type: Bug
> Reporter: Ivan Bessonov
> Assignee: Alexander Lapin
> Priority: Major
> Labels: ignite-3
> Time Spent: 1h 10m
> Remaining Estimate: 0h
>
> In the scenario of creating a lot of table and having slow system
> (presumably), it's possible to notice {{Safe time reordering detected
> [current=...}} assertion error in logs.
> It happens with safe-time sync commands, in the absence of transactional load.
> h3. UPD #1
> Following steps will bring us to the "Safe time reordering detected" problem.
> 0. PartitionReplicaListener and PartitionListener are located on the same
> Ignite node. No serialization/deserialization within raft itself.
> maxObservableSafeTime = -1, maxObservableSafeTimeVerifier = -1.
> 1. SafeTimePropagatingCommand with safeTime = 10 successfully passes
> maxObservableSafeTime gateway, messaging thread is free, raft thread handles
> the command and hangs before stateMachine.onWrite() -> maxObservableSafeTime
> = 10, maxObservableSafeTimeVerifier = -1
> 2. Client for some reason (e.g. TimeoutException) re-send same command that
> will be rolled back because both command.safeTime and maxObservableSafeTime
> == 10.
> 3. Client updates the command with new safeTime (e.g. 20) which also will
> modify "initial command" from the step 1.
> 4. Raft thread from the step 1 apply unintentionally updated command to a
> state machine -> maxObservableSafeTime = 10, maxObservableSafeTimeVerifier =
> 20 (not expected, should be the same as in maxObservableSafeTime).
> 5. Retry command from step 3 successfully passes the gateway and face an
> assertion on maxObservableSafeTimeVerifier == command.safeTime
> Rather often there was exact safeTime matching in logs, that proves given
> explanation.
> {code:java}
> [2023-11-29T14:27:05,893][ERROR][%irdt_tdpsoen_20002%JRaft-FSMCaller-Disruptor-_stripe_7-0][StripedDisruptor]
> Handle disruptor event error
> [name=%irdt_tdpsoen_20002%JRaft-FSMCaller-Disruptor-,
> event=org.apache.ignite.raft.jraft.core.FSMCallerImpl$ApplyTask@2e6cd30d,
> hasHandler=false]
> java.lang.AssertionError: Safe time reordering detected
> [current=111494301331423233, proposed=111494301331423233]
> All in all, that means, that messages should be immutable.{code}
> * Initially, to verify that aforementioned explanation is correct, it's
> possible to clone the message and update the cloned copy.
> * However, it's better to exclude the ability to call set<> on message
> itself. Meaning that some sort of factory method is much more reliable here.
> That will be covered in a separate ticket.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)