[ https://issues.apache.org/jira/browse/FLINK-28392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563009#comment-17563009 ]
Zhu Zhu edited comment on FLINK-28392 at 7/6/22 7:25 AM: --------------------------------------------------------- Thanks for reporting this issue and investigating it. [~martijnvisser] [~chesnay] This issue happens because the future executor was shutdown when the test is done, while the scheduler was still working and trying to deploy tasks. There were 2 exceptions caused by schedule tasks via the failed executor, - CompletableFuture.supplyAsync(...) in Execution#deploy() - delayExecutor.schedule(...) in DefaultScheduler#restartTasksWithDelay(...) They together make the error to be a fatal one, because an error was thrown in Execution#markFailed() and lead to fail an Execution twice. The ExecutionDeployer encounters this problem because it retrieves execution from ExecutionGraph#currentExecutions but the execution was unregistered during the first round failing the Execution. Previously this problem did not happen because it failed the Execution via ExecutionVertex, so it will not affected by the execution unregistering. Theoretically, we should fix all the tests which shuts down the executor before terminating the job/scheduler. But I'm afraid there are many such kind of tests and we may easily miss to fix some of them. Therefore, I'm thinking to changing the ExecutionDeployer to not retrieve executions via ExecutionGraph#currentExecutions, to tolerate this case as before, and to also avoid similar problems that may happen in production. was (Author: zhuzh): Thanks for reporting this issue and investigating it. [~martijnvisser] [~chesnay] This issue happens because the future executor was shutdown when the test is done, while the scheduler was still working and trying to deploy tasks. There were 2 exceptions caused by schedule tasks via the failed executor, - CompletableFuture.supplyAsync(...) in Execution#deploy() - delayExecutor.schedule(...) in DefaultScheduler#restartTasksWithDelay(...) They together make the error to be a fatal one, because an error was thrown in Execution#markFailed() and lead to fail an Execution twice. The ExecutionDeployer encounters this problem because it retrieves execution from ExecutionGraph#currentExecutions but the execution was unregistered during the first round failing the Execution. Previously this problem did not happen because it failed the Execution via ExecutionVertex, so it will not affected by the execution unregistering. Theoretically, we should fix all the tests which shuts down the executor before terminating the job/scheduler. But I'm afraid there are many such kind of tests and we may easily miss to fix some of them. Therefore, I'm thinking to changing the ExecutionDeployer to not retrieve executions via ExecutionGraph#currentExecutions, to tolerate this case as before, and to also avoid similar problems that may happen in production. > RemoveCachedShuffleDescriptorTest#testRemoveOffloadedCacheForPointwiseEdgeAfterFailover > causes fatal error on CI > ---------------------------------------------------------------------------------------------------------------- > > Key: FLINK-28392 > URL: https://issues.apache.org/jira/browse/FLINK-28392 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.16.0 > Reporter: Martijn Visser > Assignee: Chesnay Schepler > Priority: Blocker > Labels: pull-request-available > Fix For: 1.16.0 > > > {code:java} > Jul 05 03:30:03 [ERROR] Error occurred in starting fork, check output in log > Jul 05 03:30:03 [ERROR] Process Exit Code: 239 > Jul 05 03:30:03 [ERROR] Crashed tests: > Jul 05 03:30:03 [ERROR] > org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategyTest > Jul 05 03:30:03 [ERROR] > org.apache.maven.surefire.booter.SurefireBooterForkException: > ExecutionException The forked VM terminated without properly saying goodbye. > VM crash or System.exit called? > Jul 05 03:30:03 [ERROR] Command was /bin/sh -c cd /__w/1/s/flink-runtime && > /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -XX:+UseG1GC -Xms256m -Xmx768m > -jar > /__w/1/s/flink-runtime/target/surefire/surefirebooter4932865857415988980.jar > /__w/1/s/flink-runtime/target/surefire 2022-07-05T03-23-25_404-jvmRun1 > surefire8916732512419442726tmp surefire_2130262314165063415tmp > Jul 05 03:30:03 [ERROR] Error occurred in starting fork, check output in log > Jul 05 03:30:03 [ERROR] Process Exit Code: 239 > Jul 05 03:30:03 [ERROR] Crashed tests: > Jul 05 03:30:03 [ERROR] > org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategyTest > Jul 05 03:30:03 [ERROR] at > org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:532) > Jul 05 03:30:03 [ERROR] at > org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkOnceMultiple(ForkStarter.java:405) > Jul 05 03:30:03 [ERROR] at > org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:321) > Jul 05 03:30:03 [ERROR] at > org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:266) > Jul 05 03:30:03 [ERROR] at > org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1314) > Jul 05 03:30:03 [ERROR] at > org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1159) > {code} > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=37602&view=logs&j=4d4a0d10-fca2-5507-8eed-c07f0bdf4887&t=7b25afdf-cc6c-566f-5459-359dc2585798&l=8147 -- This message was sent by Atlassian Jira (v8.20.10#820010)