[
https://issues.apache.org/jira/browse/FLINK-29789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sopan Phaltankar updated FLINK-29789:
-------------------------------------
Description:
The test
org.apache.flink.runtime.checkpoint.CheckpointCoordinatorTest.testTriggerAndDeclineCheckpointComplex
is flaky and has the following failure:
Failures:
[ERROR] Failures:
[ERROR] CheckpointCoordinatorTest.testTriggerAndDeclineCheckpointComplex:1054
expected:<2> but was:<1>
I used the tool (NonDex|https://github.com/TestingResearchIllinois/NonDex) to
find this flaky test.
Command: mvn -pl flink-runtime edu.illinois:nondex-maven-plugun:1.1.2:nondex
-Dtest=org.apache.flink.runtime.checkpoint.CheckpointCoordinatorTest#testTriggerAndDeclineCheckpointComplex
I analyzed the assertion failure and found that checkpoint1Id and checkpoint2Id
are getting assigned by iterating over a HashMap.
As we know, iterator() returns elements in a random order
(JavaDoc|https://docs.oracle.com/javase/8/docs/api/java/util/HashMap.html#entrySet--)
and this might cause test failures for some orders.
Therefore, to remove this non-determinism, we would change HashMap to
LinkedHashMap.
On further analysis, it was found that the Map is getting initialized on line
1894 of org.apache.flink.runtime.checkpoint.CheckpointCoordinator class.
After changing from HashMap to LinkedHashMap, the above test is passing without
any non-determinism.
was:
The test
_org.apache.flink.streaming.api.operators.co.CoBroadcastWithNonKeyedOperatorTest.testMultiStateSupport_
has the following failure:
Failures:
[ERROR] CoBroadcastWithNonKeyedOperatorTest.testMultiStateSupport:74
Wrong Side Output: arrays first differed at element [0]; expected:<Record @ 15
: 9:key.6->6> but was:<Record @ 15 : 9:key.5->5>
I used the tool [NonDex|https://github.com/TestingResearchIllinois/NonDex] to
find this flaky test.
Command: mvn edu.illinois:nondex-maven-plugun:1.1.2:nondex -Dtest='Fully
Qualified Test Name'
I analyzed the assertion failure and found that the root cause is because the
test method calls ctx.getBroadcastState(STATE_DESCRIPTOR).immutableEntries()
which calls the entrySet() method of the underlying HashMap. entrySet() returns
the entries in a non-deterministic way, causing the test to be flaky.
The fix would be to change _HashMap_ to _LinkedHashMap_ where the Map is
getting initialized.
On further analysis, it was found that the Map is getting initialized on line
53 of org.apache.flink.runtime.state.HeapBroadcastState class.
After changing from HashMap to LinkedHashMap, the above test is passing.
Edit: Upon making this change and running the CI, it was found that the tests
org.apache.flink.api.datastream.DataStreamBatchExecutionITCase.batchKeyedBroadcastExecution
and
org.apache.flink.api.datastream.DataStreamBatchExecutionITCase.batchBroadcastExecution
were failing. Upon further investigation, I found that these tests were also
flaky and depended on the earlier made change.
> Fix flaky tests in CheckpointCoordinatorTest
> --------------------------------------------
>
> Key: FLINK-29789
> URL: https://issues.apache.org/jira/browse/FLINK-29789
> Project: Flink
> Issue Type: Bug
> Reporter: Sopan Phaltankar
> Priority: Minor
> Labels: pull-request-available
>
> The test
> org.apache.flink.runtime.checkpoint.CheckpointCoordinatorTest.testTriggerAndDeclineCheckpointComplex
> is flaky and has the following failure:
> Failures:
> [ERROR] Failures:
> [ERROR]
> CheckpointCoordinatorTest.testTriggerAndDeclineCheckpointComplex:1054
> expected:<2> but was:<1>
> I used the tool (NonDex|https://github.com/TestingResearchIllinois/NonDex) to
> find this flaky test.
> Command: mvn -pl flink-runtime edu.illinois:nondex-maven-plugun:1.1.2:nondex
> -Dtest=org.apache.flink.runtime.checkpoint.CheckpointCoordinatorTest#testTriggerAndDeclineCheckpointComplex
> I analyzed the assertion failure and found that checkpoint1Id and
> checkpoint2Id are getting assigned by iterating over a HashMap.
> As we know, iterator() returns elements in a random order
> (JavaDoc|https://docs.oracle.com/javase/8/docs/api/java/util/HashMap.html#entrySet--)
> and this might cause test failures for some orders.
> Therefore, to remove this non-determinism, we would change HashMap to
> LinkedHashMap.
> On further analysis, it was found that the Map is getting initialized on line
> 1894 of org.apache.flink.runtime.checkpoint.CheckpointCoordinator class.
> After changing from HashMap to LinkedHashMap, the above test is passing
> without any non-determinism.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)