[
https://issues.apache.org/jira/browse/IGNITE-20441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aleksandr Polovtcev updated IGNITE-20441:
-----------------------------------------
Description:
{{org.apache.ignite.internal.rebalance.ItRebalanceRecoveryTest}} is flaky on
TC, see [TC
history|https://ci.ignite.apache.org/test/5525445229026025351?currentProjectId=ApacheIgnite3xGradle_Test_IntegrationTests&expandTestHistoryChartSection=true&branch=%3Cdefault%3E].
After some invsetigation, I found out that the problem is the following:
# Imagine two threads: A and B.
# Thread A executes an update from the test. Primary replica generates a
timestamp, creates a Raft command and tries to apply it, but the thread stalls
for any reason.
# Thread B performs idle safe time sync. Primary replica generates a timestamp
(larger than the timestamp from the previous step), creates a Raft command and
successfully applies it.
# Thread A resumes its execution and applies its command. This means that the
Raft command from thread B will be applied before the command from thread A,
despite their timestamps being ordered differently.
# According to the test protocol, a node gets restarted and needs to apply the
missing Raft log part, which means re-applying the two commands above. However,
the second command (which inserts data) will be ignored, because there's code
in {{PartitionListener}} that ignores storage updates if their timestamp is
smaller than the current timestamp (which got updated by the first command).
Therefore, this bug is definitely caused by IGNITE-20116
was:
{{org.apache.ignite.internal.rebalance.ItRebalanceRecoveryTest}} is flaky on
TC, see [TC
history|https://ci.ignite.apache.org/test/5525445229026025351?currentProjectId=ApacheIgnite3xGradle_Test_IntegrationTests&expandTestHistoryChartSection=true&branch=%3Cdefault%3E].
After some invsetigation, I found out that the problem is the following:
# Imagine two threads: A and B.
# Thread A executes an update from the test. Primary replica generates a
timestamp, creates a Raft command and tries to apply it, but the thread stalls
for any reason.
# Thread B performs idle safe time sync. Primary replica generates a timestamp
(larger than the timestamp from the previous step), creates a Raft command and
successfully applies it.
# Thread A resumes its execution and applies its command. This means that the
Raft command from thread B will be applied before the command from thread A,
despite their timestamps being ordered differently.
# According to the test protocol, a node gets restarted and needs to apply the
missing Raft log part, which means re-applying the two commands above. However,
the second command (which inserts data) will be ignored, because there's code
in {{PartitionListener}} that ignores storage updates if their timestamp is
smaller than the current timestamp (which got updated by the first command).
Therefore, this bug is definitely caused by
https://issues.apache.org/jira/browse/IGNITE-20116
> ItRebalanceRecoveryTest is flaky
> --------------------------------
>
> Key: IGNITE-20441
> URL: https://issues.apache.org/jira/browse/IGNITE-20441
> Project: Ignite
> Issue Type: Task
> Reporter: Aleksandr Polovtcev
> Assignee: Aleksandr Polovtcev
> Priority: Blocker
> Labels: ignite-3
>
> {{org.apache.ignite.internal.rebalance.ItRebalanceRecoveryTest}} is flaky on
> TC, see [TC
> history|https://ci.ignite.apache.org/test/5525445229026025351?currentProjectId=ApacheIgnite3xGradle_Test_IntegrationTests&expandTestHistoryChartSection=true&branch=%3Cdefault%3E].
> After some invsetigation, I found out that the problem is the following:
> # Imagine two threads: A and B.
> # Thread A executes an update from the test. Primary replica generates a
> timestamp, creates a Raft command and tries to apply it, but the thread
> stalls for any reason.
> # Thread B performs idle safe time sync. Primary replica generates a
> timestamp (larger than the timestamp from the previous step), creates a Raft
> command and successfully applies it.
> # Thread A resumes its execution and applies its command. This means that the
> Raft command from thread B will be applied before the command from thread A,
> despite their timestamps being ordered differently.
> # According to the test protocol, a node gets restarted and needs to apply
> the missing Raft log part, which means re-applying the two commands above.
> However, the second command (which inserts data) will be ignored, because
> there's code in {{PartitionListener}} that ignores storage updates if their
> timestamp is smaller than the current timestamp (which got updated by the
> first command).
>
> Therefore, this bug is definitely caused by IGNITE-20116
--
This message was sent by Atlassian Jira
(v8.20.10#820010)