[jira] [Updated] (IGNITE-20441) ItRebalanceRecoveryTest is flaky

Aleksandr Polovtcev (Jira) Wed, 20 Sep 2023 06:28:50 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-20441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Aleksandr Polovtcev updated IGNITE-20441:
-----------------------------------------
    Description: 
{{org.apache.ignite.internal.rebalance.ItRebalanceRecoveryTest}} is flaky on 
TC, see [TC 
history|https://ci.ignite.apache.org/test/5525445229026025351?currentProjectId=ApacheIgnite3xGradle_Test_IntegrationTests&expandTestHistoryChartSection=true&branch=%3Cdefault%3E].

After some invsetigation, I found out that the problem is the following:

# Imagine two threads: A and B.
# Thread A executes an update from the test. Primary replica generates a 
timestamp, creates a Raft command and tries to apply it, but the thread stalls 
for any reason.
# Thread B performs idle safe time sync. Primary replica generates a timestamp 
(larger than the timestamp from the previous step), creates a Raft command and 
successfully applies it.
# Thread A resumes its execution and applies its command. This means that the 
Raft command from thread B will be applied before the command from thread A, 
despite their timestamps being ordered differently.
# According to the test protocol, a node gets restarted and needs to apply the 
missing Raft log part, which means re-applying the two commands above. However, 
the second command (which inserts data) will be ignored, because there's code 
in {{PartitionListener}} that ignores storage updates if their timestamp is 
smaller than the current timestamp (which got updated by the first command).
 
Therefore, this bug is definitely caused by IGNITE-20116


  was:
{{org.apache.ignite.internal.rebalance.ItRebalanceRecoveryTest}} is flaky on 
TC, see [TC 
history|https://ci.ignite.apache.org/test/5525445229026025351?currentProjectId=ApacheIgnite3xGradle_Test_IntegrationTests&expandTestHistoryChartSection=true&branch=%3Cdefault%3E].

After some invsetigation, I found out that the problem is the following:

# Imagine two threads: A and B.
# Thread A executes an update from the test. Primary replica generates a 
timestamp, creates a Raft command and tries to apply it, but the thread stalls 
for any reason.
# Thread B performs idle safe time sync. Primary replica generates a timestamp 
(larger than the timestamp from the previous step), creates a Raft command and 
successfully applies it.
# Thread A resumes its execution and applies its command. This means that the 
Raft command from thread B will be applied before the command from thread A, 
despite their timestamps being ordered differently.
# According to the test protocol, a node gets restarted and needs to apply the 
missing Raft log part, which means re-applying the two commands above. However, 
the second command (which inserts data) will be ignored, because there's code 
in {{PartitionListener}} that ignores storage updates if their timestamp is 
smaller than the current timestamp (which got updated by the first command).
 
Therefore, this bug is definitely caused by 
https://issues.apache.org/jira/browse/IGNITE-20116



> ItRebalanceRecoveryTest is flaky
> --------------------------------
>
>                 Key: IGNITE-20441
>                 URL: https://issues.apache.org/jira/browse/IGNITE-20441
>             Project: Ignite
>          Issue Type: Task
>            Reporter: Aleksandr Polovtcev
>            Assignee: Aleksandr Polovtcev
>            Priority: Blocker
>              Labels: ignite-3
>
> {{org.apache.ignite.internal.rebalance.ItRebalanceRecoveryTest}} is flaky on 
> TC, see [TC 
> history|https://ci.ignite.apache.org/test/5525445229026025351?currentProjectId=ApacheIgnite3xGradle_Test_IntegrationTests&expandTestHistoryChartSection=true&branch=%3Cdefault%3E].
> After some invsetigation, I found out that the problem is the following:
> # Imagine two threads: A and B.
> # Thread A executes an update from the test. Primary replica generates a 
> timestamp, creates a Raft command and tries to apply it, but the thread 
> stalls for any reason.
> # Thread B performs idle safe time sync. Primary replica generates a 
> timestamp (larger than the timestamp from the previous step), creates a Raft 
> command and successfully applies it.
> # Thread A resumes its execution and applies its command. This means that the 
> Raft command from thread B will be applied before the command from thread A, 
> despite their timestamps being ordered differently.
> # According to the test protocol, a node gets restarted and needs to apply 
> the missing Raft log part, which means re-applying the two commands above. 
> However, the second command (which inserts data) will be ignored, because 
> there's code in {{PartitionListener}} that ignores storage updates if their 
> timestamp is smaller than the current timestamp (which got updated by the 
> first command).
>  
> Therefore, this bug is definitely caused by IGNITE-20116



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-20441) ItRebalanceRecoveryTest is flaky

Reply via email to