[ https://issues.apache.org/jira/browse/FLINK-31133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Roman Khachatryan updated FLINK-31133: -------------------------------------- Priority: Major (was: Critical) > AdaptiveSchedulerITCase took extraordinary long to finish > --------------------------------------------------------- > > Key: FLINK-31133 > URL: https://issues.apache.org/jira/browse/FLINK-31133 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.15.3 > Reporter: Matthias Pohl > Assignee: Roman Khachatryan > Priority: Major > Labels: test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46299&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b > This build ran into a timeout. Based on the stacktraces reported, it was > either caused by > [SnapshotMigrationTestBase.restoreAndExecute|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46299&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b&l=13475]: > {code} > "main" #1 prio=5 os_prio=0 tid=0x00007f23d800b800 nid=0x60cdd waiting on > condition [0x00007f23e1c0d000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > org.apache.flink.test.checkpointing.utils.SnapshotMigrationTestBase.restoreAndExecute(SnapshotMigrationTestBase.java:382) > at > org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSnapshot(TypeSerializerSnapshotMigrationITCase.java:172) > at sun.reflect.GeneratedMethodAccessor47.invoke(Unknown Source) > [...] > {code} > or > [PartiallyFinishedSourcesITCase.test|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46299&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b&l=10401]: > {code} > 2023-02-20T07:13:05.6084711Z "main" #1 prio=5 os_prio=0 > tid=0x00007fd35c00b800 nid=0x8c8a waiting on condition [0x00007fd363d0f000] > 2023-02-20T07:13:05.6085149Z java.lang.Thread.State: TIMED_WAITING > (sleeping) > 2023-02-20T07:13:05.6085487Z at java.lang.Thread.sleep(Native Method) > 2023-02-20T07:13:05.6085925Z at > org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:145) > 2023-02-20T07:13:05.6086512Z at > org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:138) > 2023-02-20T07:13:05.6087103Z at > org.apache.flink.runtime.testutils.CommonTestUtils.waitForSubtasksToFinish(CommonTestUtils.java:291) > 2023-02-20T07:13:05.6087730Z at > org.apache.flink.runtime.operators.lifecycle.TestJobExecutor.waitForSubtasksToFinish(TestJobExecutor.java:226) > 2023-02-20T07:13:05.6088410Z at > org.apache.flink.runtime.operators.lifecycle.PartiallyFinishedSourcesITCase.test(PartiallyFinishedSourcesITCase.java:138) > 2023-02-20T07:13:05.6088957Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > [...] > {code} > Still, it sounds odd: Based on a code analysis it's quite unlikely that those > two caused the issue. The former one has a 5 min timeout (see related code in > [SnapshotMigrationTestBase:382|https://github.com/apache/flink/blob/release-1.15/flink-tests/src/test/java/org/apache/flink/test/checkpointing/utils/SnapshotMigrationTestBase.java#L382]). > For the other one, we found it being not responsible in the past when some > other concurrent test caused the issue (see FLINK-30261). > An investigation on where we lose the time for the timeout revealed that > {{AdaptiveSchedulerITCase}} took 2980s to finish (see [build > logs|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46299&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b&l=5265]). > {code} > 2023-02-20T03:43:55.4546050Z Feb 20 03:43:55 [ERROR] Picked up > JAVA_TOOL_OPTIONS: -XX:+HeapDumpOnOutOfMemoryError > 2023-02-20T03:43:58.0448506Z Feb 20 03:43:58 [INFO] Running > org.apache.flink.test.scheduling.AdaptiveSchedulerITCase > 2023-02-20T04:33:38.6824634Z Feb 20 04:33:38 [INFO] Tests run: 6, Failures: > 0, Errors: 0, Skipped: 0, Time elapsed: 2,980.445 s - in > org.apache.flink.test.scheduling.AdaptiveSchedulerITCase > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)