[ 
https://issues.apache.org/jira/browse/FLINK-30301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17645029#comment-17645029
 ] 

Roman Khachatryan commented on FLINK-30301:
-------------------------------------------

Sorry for introducing the instability and thanks for reporting it.

 

I found locally that the existing heartbeat timeout of 1s can be too small when 
the machine is overloaded.

This timeout causes TM to release the connection with JM, cancelling all the 
tasks.

Then subsequent "cancelTask" calls will fail with the above exception.

 

This can be tested by increasing the number of tasks from 10 to e.g. 100 
[here|https://github.com/apache/flink/blob/d86ae5d642fa578fb85118e81dd4140d504f818a/flink-runtime/src/test/java/org/apache/flink/runtime/taskexecutor/TaskExecutorTest.java#L3042];
 or adding `Thread.sleep(1s)` right before `cancelTask` 
[here|https://github.com/apache/flink/blob/d86ae5d642fa578fb85118e81dd4140d504f818a/flink-runtime/src/test/java/org/apache/flink/runtime/taskexecutor/TaskExecutorTest.java#L3077].

 

I think simply increasing the timeouts to large values should be enough. 
Otherwise, a new `HeartbeatServices` has to be added.

I've created a [PR|https://github.com/apache/flink/pull/21467] for that - would 
you be able to take a look [~mapohl] ?

> TaskExecutorTest.testSharedResourcesLifecycle failed with TaskException
> -----------------------------------------------------------------------
>
>                 Key: FLINK-30301
>                 URL: https://issues.apache.org/jira/browse/FLINK-30301
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.17.0
>            Reporter: Matthias Pohl
>            Assignee: Roman Khachatryan
>            Priority: Major
>              Labels: test-stability
>
> This seems to be a follow-up of FLINK-30275. Same test but different test 
> failure (2x in the same build):
> * 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43709&view=logs&j=77a9d8e1-d610-59b3-fc2a-4766541e0e33&t=125e07e7-8de0-5c6c-a541-a567415af3ef&l=7479
> * 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43709&view=logs&j=4d4a0d10-fca2-5507-8eed-c07f0bdf4887&t=7b25afdf-cc6c-566f-5459-359dc2585798&l=7852
> {code}
> Dec 05 03:59:18       at 
> org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:114)
> Dec 05 03:59:18       at 
> org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:86)
> Dec 05 03:59:18       at 
> org.junit.platform.launcher.core.DefaultLauncherSession$DelegatingLauncher.execute(DefaultLauncherSession.java:86)
> Dec 05 03:59:18       at 
> org.junit.platform.launcher.core.SessionPerRequestLauncher.execute(SessionPerRequestLauncher.java:53)
> Dec 05 03:59:18       at 
> org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.lambda$execute$1(JUnitPlatformProvider.java:199)
> Dec 05 03:59:18       at 
> java.util.Iterator.forEachRemaining(Iterator.java:116)
> Dec 05 03:59:18       at 
> org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.execute(JUnitPlatformProvider.java:193)
> Dec 05 03:59:18       at 
> org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invokeAllTests(JUnitPlatformProvider.java:154)
> Dec 05 03:59:18       at 
> org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invoke(JUnitPlatformProvider.java:120)
> Dec 05 03:59:18       at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:428)
> Dec 05 03:59:18       at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:162)
> Dec 05 03:59:18       at 
> org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:562)
> Dec 05 03:59:18       at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:548)
> Dec 05 03:59:18 Caused by: 
> org.apache.flink.runtime.taskexecutor.exceptions.TaskException: Cannot find 
> task to stop for execution 
> 096b33c46c225fd4af41a9484b64c7fe_010f83ce510d70707aaf04c441173b70_0_0.
> Dec 05 03:59:18       at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor.cancelTask(TaskExecutor.java:864)
> Dec 05 03:59:18       ... 53 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to