[
https://issues.apache.org/jira/browse/FLINK-31133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693544#comment-17693544
]
Roman Khachatryan commented on FLINK-31133:
-------------------------------------------
There are two issues in case of a checkpoint failure:
# FAIL command might be dispatched to the source task that's already finished
execution
# waiting for the failover times out, but it then waits indefinitely to obtain
job status result
The issue affects only 1.15 because in later versions, state upload timeout and
nr. of attempts were increased in FLINK-27169.
I've created a [PR|https://github.com/apache/flink/pull/22022] to address (1)
and (2) and reopened FLINK-27169 to backport increased timeouts/attempts to
1.15.
> PartiallyFinishedSourcesITCase hangs if a checkpoint fails
> ----------------------------------------------------------
>
> Key: FLINK-31133
> URL: https://issues.apache.org/jira/browse/FLINK-31133
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.15.3, 1.16.1, 1.18.0, 1.17.1
> Reporter: Matthias Pohl
> Assignee: Roman Khachatryan
> Priority: Major
> Labels: test-stability
> Fix For: 1.15.4, 1.16.2, 1.18.0, 1.17.1
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46299&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b
> This build ran into a timeout. Based on the stacktraces reported, it was
> either caused by
> [SnapshotMigrationTestBase.restoreAndExecute|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46299&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b&l=13475]:
> {code}
> "main" #1 prio=5 os_prio=0 tid=0x00007f23d800b800 nid=0x60cdd waiting on
> condition [0x00007f23e1c0d000]
> java.lang.Thread.State: TIMED_WAITING (sleeping)
> at java.lang.Thread.sleep(Native Method)
> at
> org.apache.flink.test.checkpointing.utils.SnapshotMigrationTestBase.restoreAndExecute(SnapshotMigrationTestBase.java:382)
> at
> org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSnapshot(TypeSerializerSnapshotMigrationITCase.java:172)
> at sun.reflect.GeneratedMethodAccessor47.invoke(Unknown Source)
> [...]
> {code}
> or
> [PartiallyFinishedSourcesITCase.test|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46299&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b&l=10401]:
> {code}
> 2023-02-20T07:13:05.6084711Z "main" #1 prio=5 os_prio=0
> tid=0x00007fd35c00b800 nid=0x8c8a waiting on condition [0x00007fd363d0f000]
> 2023-02-20T07:13:05.6085149Z java.lang.Thread.State: TIMED_WAITING
> (sleeping)
> 2023-02-20T07:13:05.6085487Z at java.lang.Thread.sleep(Native Method)
> 2023-02-20T07:13:05.6085925Z at
> org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:145)
> 2023-02-20T07:13:05.6086512Z at
> org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:138)
> 2023-02-20T07:13:05.6087103Z at
> org.apache.flink.runtime.testutils.CommonTestUtils.waitForSubtasksToFinish(CommonTestUtils.java:291)
> 2023-02-20T07:13:05.6087730Z at
> org.apache.flink.runtime.operators.lifecycle.TestJobExecutor.waitForSubtasksToFinish(TestJobExecutor.java:226)
> 2023-02-20T07:13:05.6088410Z at
> org.apache.flink.runtime.operators.lifecycle.PartiallyFinishedSourcesITCase.test(PartiallyFinishedSourcesITCase.java:138)
> 2023-02-20T07:13:05.6088957Z at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> [...]
> {code}
> Still, it sounds odd: Based on a code analysis it's quite unlikely that those
> two caused the issue. The former one has a 5 min timeout (see related code in
> [SnapshotMigrationTestBase:382|https://github.com/apache/flink/blob/release-1.15/flink-tests/src/test/java/org/apache/flink/test/checkpointing/utils/SnapshotMigrationTestBase.java#L382]).
> For the other one, we found it being not responsible in the past when some
> other concurrent test caused the issue (see FLINK-30261).
> An investigation on where we lose the time for the timeout revealed that
> {{AdaptiveSchedulerITCase}} took 2980s to finish (see [build
> logs|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46299&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b&l=5265]).
> {code}
> 2023-02-20T03:43:55.4546050Z Feb 20 03:43:55 [ERROR] Picked up
> JAVA_TOOL_OPTIONS: -XX:+HeapDumpOnOutOfMemoryError
> 2023-02-20T03:43:58.0448506Z Feb 20 03:43:58 [INFO] Running
> org.apache.flink.test.scheduling.AdaptiveSchedulerITCase
> 2023-02-20T04:33:38.6824634Z Feb 20 04:33:38 [INFO] Tests run: 6, Failures:
> 0, Errors: 0, Skipped: 0, Time elapsed: 2,980.445 s - in
> org.apache.flink.test.scheduling.AdaptiveSchedulerITCase
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)