[
https://issues.apache.org/jira/browse/IGNITE-23661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexander Lapin updated IGNITE-23661:
-------------------------------------
Description:
The test may fail with
{code:java}
org.opentest4j.AssertionFailedError: expected: <true> but was: <false> at
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
at
app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) at
app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) at
app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31) at
app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:183) at
app//org.apache.ignite.internal.distributionzones.ItIgniteDistributionZoneManagerNodeRestartTest.testFirstLogicalTopologyUpdateInterruptedEventRestoredAfterRestart(ItIgniteDistributionZoneManagerNodeRestartTest.java:562)
at [email protected]/java.lang.reflect.Method.invoke(Method.java:566) at
[email protected]/java.util.ArrayList.forEach(ArrayList.java:1541) at
[email protected]/java.util.ArrayList.forEach(ArrayList.java:1541) {code}
Reproduced locally 1/100
Long story short, the stack trace of a problem is following:
1.
{code:java}
assertTrue(waitForCondition(() ->
newLogicalTopology.equals(finalDistributionZoneManager.logicalTopology()){code}
fails because
{code:java}
finalDistributionZoneManager.logicalTopology(){code}
is not updated because
2.
notificationFuture in WatchProcessor#notifyWatches() is not completed because
notifyUpdateRevisionFuture is not completed because
3.
{code:java}
assert causalityToken > lastCompleteToken{code}
in
org.apache.ignite.internal.causality.IncrementalVersionedValue#completeInternal
isn't matched (but not logged though) because
4.
watch events are reordered by revision despite the fact that they are properly
ordered by timestamp, see
org.apache.ignite.internal.metastorage.server.NotifyWatchProcessorEvent#compareTo
for more details.
5.
All in all, order mismatch is a result of non-thread safe
StandaloneMetaStorageManager. Both onBeforeApply and command processing within
listener should be thread-safe. That's why there's a race between operation
safeTime assignment and revision calculation. onBeforeApply is guarded by group
specific monitor, precisely
{code:java}
synchronized (groupIdSyncMonitor(request.groupId())){code}
See ActionRequestProcessor.handleRequestInternal for more details. Command
processing on its turn is expected to be processed under raft umbrella, meaning
in single-thread environment.
And corresponding synchronisation was missed in StandaloneMetaStorageManager.
was:
The test may fail with
{code:java}
org.opentest4j.AssertionFailedError: expected: <true> but was: <false> at
app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
at
app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) at
app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) at
app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31) at
app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:183) at
app//org.apache.ignite.internal.distributionzones.ItIgniteDistributionZoneManagerNodeRestartTest.testFirstLogicalTopologyUpdateInterruptedEventRestoredAfterRestart(ItIgniteDistributionZoneManagerNodeRestartTest.java:562)
at [email protected]/java.lang.reflect.Method.invoke(Method.java:566) at
[email protected]/java.util.ArrayList.forEach(ArrayList.java:1541) at
[email protected]/java.util.ArrayList.forEach(ArrayList.java:1541) {code}
Reproduced locally 1/100
> ItIgniteDistributionZoneManagerNodeRestartTest.testFirstLogicalTopologyUpdateInterruptedEventRestoredAfterRestart
> is flaky
> --------------------------------------------------------------------------------------------------------------------------
>
> Key: IGNITE-23661
> URL: https://issues.apache.org/jira/browse/IGNITE-23661
> Project: Ignite
> Issue Type: Bug
> Reporter: Alexander Lapin
> Assignee: Alexander Lapin
> Priority: Major
> Labels: ignite-3
> Time Spent: 10m
> Remaining Estimate: 0h
>
> The test may fail with
> {code:java}
> org.opentest4j.AssertionFailedError: expected: <true> but was: <false> at
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
> at
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
> at app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
> at app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) at
> app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31) at
> app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:183) at
> app//org.apache.ignite.internal.distributionzones.ItIgniteDistributionZoneManagerNodeRestartTest.testFirstLogicalTopologyUpdateInterruptedEventRestoredAfterRestart(ItIgniteDistributionZoneManagerNodeRestartTest.java:562)
> at [email protected]/java.lang.reflect.Method.invoke(Method.java:566) at
> [email protected]/java.util.ArrayList.forEach(ArrayList.java:1541) at
> [email protected]/java.util.ArrayList.forEach(ArrayList.java:1541) {code}
> Reproduced locally 1/100
>
> Long story short, the stack trace of a problem is following:
> 1.
> {code:java}
> assertTrue(waitForCondition(() ->
> newLogicalTopology.equals(finalDistributionZoneManager.logicalTopology()){code}
> fails because
> {code:java}
> finalDistributionZoneManager.logicalTopology(){code}
> is not updated because
>
> 2.
> notificationFuture in WatchProcessor#notifyWatches() is not completed because
> notifyUpdateRevisionFuture is not completed because
>
> 3.
> {code:java}
> assert causalityToken > lastCompleteToken{code}
> in
> org.apache.ignite.internal.causality.IncrementalVersionedValue#completeInternal
> isn't matched (but not logged though) because
>
> 4.
> watch events are reordered by revision despite the fact that they are
> properly ordered by timestamp, see
> org.apache.ignite.internal.metastorage.server.NotifyWatchProcessorEvent#compareTo
> for more details.
>
> 5.
> All in all, order mismatch is a result of non-thread safe
> StandaloneMetaStorageManager. Both onBeforeApply and command processing
> within listener should be thread-safe. That's why there's a race between
> operation safeTime assignment and revision calculation. onBeforeApply is
> guarded by group specific monitor, precisely
> {code:java}
> synchronized (groupIdSyncMonitor(request.groupId())){code}
> See ActionRequestProcessor.handleRequestInternal for more details. Command
> processing on its turn is expected to be processed under raft umbrella,
> meaning in single-thread environment.
> And corresponding synchronisation was missed in StandaloneMetaStorageManager.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)