[jira] [Updated] (IGNITE-21062) Safe time reordering in partitions

Alexander Lapin (Jira) Fri, 15 Dec 2023 04:53:38 -0800


     [ 
https://issues.apache.org/jira/browse/IGNITE-21062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alexander Lapin updated IGNITE-21062:
-------------------------------------
    Description: 
In the scenario of creating a lot of table and having slow system (presumably), 
it's possible to notice {{Safe time reordering detected [current=...}} 
assertion error in logs.

It happens with safe-time sync commands, in the absence of transactional load.
h3. UPD #1

Following steps will bring us to the "Safe time reordering detected" problem.

0. PartitionReplicaListener and PartitionListener are located on the same 
Ignite node. No serialization/deserialization within raft itself. 
maxObservableSafeTime = -1, maxObservableSafeTimeVerifier = -1.

1. SafeTimePropagatingCommand  with safeTime = 10 successfully passes 
maxObservableSafeTime gateway, messaging thread is free, raft thread handles 
the command and hangs before stateMachine.onWrite() -> maxObservableSafeTime = 
10, maxObservableSafeTimeVerifier = -1

2. Client for some reason (e.g. TimeoutException) re-send same command that 
will be rolled back because both command.safeTime and maxObservableSafeTime == 
10.

3. Client updates the command with new safeTime (e.g. 20) which also will 
modify "initial command" from the step 1.

4. Raft thread from the step 1 apply unintentionally updated command to a state 
machine -> maxObservableSafeTime = 10, maxObservableSafeTimeVerifier = 20 (not 
expected, should be the same as in maxObservableSafeTime).

5. Retry command from step 3 successfully passes the gateway and face an 
assertion on maxObservableSafeTimeVerifier == command.safeTime

Rather often there was exact safeTime matching in logs, that proves given 
explanation.
{code:java}
[2023-11-29T14:27:05,893][ERROR][%irdt_tdpsoen_20002%JRaft-FSMCaller-Disruptor-_stripe_7-0][StripedDisruptor]
 Handle disruptor event error 
[name=%irdt_tdpsoen_20002%JRaft-FSMCaller-Disruptor-, 
event=org.apache.ignite.raft.jraft.core.FSMCallerImpl$ApplyTask@2e6cd30d, 
hasHandler=false]
java.lang.AssertionError: Safe time reordering detected 
[current=111494301331423233, proposed=111494301331423233]
All in all, that means, that messages should be immutable.{code}
 * Initially, to verify that aforementioned explanation is correct, it's 
possible to clone the message and update the cloned copy.
 * However, it's better to exclude the ability to call set<> on message itself. 
Meaning that some sort of factory method is much more reliable here. That will 
be covered in a separate ticket.

 

  was:
In the scenario of creating a lot of table and having slow system (presumably), 
it's possible to notice {{Safe time reordering detected [current=...}} 
assertion error in logs.

It happens with safe-time sync commands, in the absence of transactional load.
h3. UPD #1

Following steps will bring us to the "Safe time reordering detected" problem.

0. PartitionReplicaListener and PartitionListener are located on the same 
Ignite node. No serialization/deserialization within raft itself. 
maxObservableSafeTime = -1, maxObservableSafeTimeVerifier = -1.

1. SafeTimePropagatingCommand  with safeTime = 10 successfully passes 
maxObservableSafeTime gateway, messaging thread is free, raft thread handles 
the command and hangs before stateMachine.onWrite() -> maxObservableSafeTime = 
10, maxObservableSafeTimeVerifier = -1

2. Client for some reason (e.g. TimeoutException) re-send same command that 
will be rolled back because both command.safeTime and maxObservableSafeTime == 
10.

3. Client updates the command with new safeTime (e.g. 20) which also will 
modify "initial command" from the step 1.

4. Raft thread from the step 1 apply unintentionally updated command to a state 
machine -> maxObservableSafeTime = 10, maxObservableSafeTimeVerifier = 20 (not 
expected, should be the same as in maxObservableSafeTime).

5. Retry command from step 3 successfully passes the gateway and face an 
assertion on maxObservableSafeTimeVerifier == command.safeTime

Rather often there was exact safeTime matching in logs, that proves given 
explanation.
[2023-11-29T14:27:05,893][ERROR][%irdt_tdpsoen_20002%JRaft-FSMCaller-Disruptor-_stripe_7-0][StripedDisruptor]
 Handle disruptor event error 
[name=%irdt_tdpsoen_20002%JRaft-FSMCaller-Disruptor-, 
event=org.apache.ignite.raft.jraft.core.FSMCallerImpl$ApplyTask@2e6cd30d, 
hasHandler=false]
  java.lang.AssertionError: Safe time reordering detected 
[current=111494301331423233, proposed=111494301331423233]
All in all, that means, that messages should be immutable.
 * Initially, to verify that aforementioned explanation is correct, it's 
possible to clone the message and update the cloned copy.
 * However, it's better to exclude the ability to call set<> on message itself. 
Meaning that some sort of factory method is much more reliable here. That will 
be covered in a separate ticket.

 


> Safe time reordering in partitions
> ----------------------------------
>
>                 Key: IGNITE-21062
>                 URL: https://issues.apache.org/jira/browse/IGNITE-21062
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Ivan Bessonov
>            Assignee: Alexander Lapin
>            Priority: Major
>              Labels: ignite-3
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> In the scenario of creating a lot of table and having slow system 
> (presumably), it's possible to notice {{Safe time reordering detected 
> [current=...}} assertion error in logs.
> It happens with safe-time sync commands, in the absence of transactional load.
> h3. UPD #1
> Following steps will bring us to the "Safe time reordering detected" problem.
> 0. PartitionReplicaListener and PartitionListener are located on the same 
> Ignite node. No serialization/deserialization within raft itself. 
> maxObservableSafeTime = -1, maxObservableSafeTimeVerifier = -1.
> 1. SafeTimePropagatingCommand  with safeTime = 10 successfully passes 
> maxObservableSafeTime gateway, messaging thread is free, raft thread handles 
> the command and hangs before stateMachine.onWrite() -> maxObservableSafeTime 
> = 10, maxObservableSafeTimeVerifier = -1
> 2. Client for some reason (e.g. TimeoutException) re-send same command that 
> will be rolled back because both command.safeTime and maxObservableSafeTime 
> == 10.
> 3. Client updates the command with new safeTime (e.g. 20) which also will 
> modify "initial command" from the step 1.
> 4. Raft thread from the step 1 apply unintentionally updated command to a 
> state machine -> maxObservableSafeTime = 10, maxObservableSafeTimeVerifier = 
> 20 (not expected, should be the same as in maxObservableSafeTime).
> 5. Retry command from step 3 successfully passes the gateway and face an 
> assertion on maxObservableSafeTimeVerifier == command.safeTime
> Rather often there was exact safeTime matching in logs, that proves given 
> explanation.
> {code:java}
> [2023-11-29T14:27:05,893][ERROR][%irdt_tdpsoen_20002%JRaft-FSMCaller-Disruptor-_stripe_7-0][StripedDisruptor]
>  Handle disruptor event error 
> [name=%irdt_tdpsoen_20002%JRaft-FSMCaller-Disruptor-, 
> event=org.apache.ignite.raft.jraft.core.FSMCallerImpl$ApplyTask@2e6cd30d, 
> hasHandler=false]
> java.lang.AssertionError: Safe time reordering detected 
> [current=111494301331423233, proposed=111494301331423233]
> All in all, that means, that messages should be immutable.{code}
>  * Initially, to verify that aforementioned explanation is correct, it's 
> possible to clone the message and update the cloned copy.
>  * However, it's better to exclude the ability to call set<> on message 
> itself. Meaning that some sort of factory method is much more reliable here. 
> That will be covered in a separate ticket.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-21062) Safe time reordering in partitions

Reply via email to