[jira] [Comment Edited] (FLINK-28392) RemoveCachedShuffleDescriptorTest#testRemoveOffloadedCacheForPointwiseEdgeAfterFailover causes fatal error on CI

Zhu Zhu (Jira) Wed, 06 Jul 2022 00:26:07 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-28392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563009#comment-17563009
 ]


Zhu Zhu edited comment on FLINK-28392 at 7/6/22 7:25 AM:
---------------------------------------------------------

Thanks for reporting this issue and investigating it. [~martijnvisser] 
[~chesnay]
This issue happens because the future executor was shutdown when the test is 
done, while the scheduler was still working and trying to deploy tasks. There 
were 2 exceptions caused by schedule tasks via the failed executor, 
 - CompletableFuture.supplyAsync(...) in Execution#deploy()
 - delayExecutor.schedule(...) in DefaultScheduler#restartTasksWithDelay(...)

They together make the error to be a fatal one, because an error was thrown in 
Execution#markFailed() and lead to fail an Execution twice. The 
ExecutionDeployer encounters this problem because it retrieves execution from 
ExecutionGraph#currentExecutions but the execution was unregistered during the 
first round failing the Execution. Previously this problem did not happen 
because it failed the Execution via ExecutionVertex, so it will not affected by 
the execution unregistering.

Theoretically, we should fix all the tests which shuts down the executor before 
terminating the job/scheduler. But I'm afraid there are many such kind of tests 
and we may easily miss to fix some of them. Therefore, I'm thinking to changing 
the ExecutionDeployer to not retrieve executions via 
ExecutionGraph#currentExecutions, to tolerate this case as before, and to also 
avoid similar problems that may happen in production.


was (Author: zhuzh):
Thanks for reporting this issue and investigating it. [~martijnvisser] 
[~chesnay]
This issue happens because the future executor was shutdown when the test is 
done, while the scheduler was still working and trying to deploy tasks. There 
were 2 exceptions caused by schedule tasks via the failed executor, 
 - CompletableFuture.supplyAsync(...) in Execution#deploy()
 - delayExecutor.schedule(...) in DefaultScheduler#restartTasksWithDelay(...)
They together make the error to be a fatal one, because an error was thrown in 
Execution#markFailed() and lead to fail an Execution twice. The 
ExecutionDeployer encounters this problem because it retrieves execution from 
ExecutionGraph#currentExecutions but the execution was unregistered during the 
first round failing the Execution. Previously this problem did not happen 
because it failed the Execution via ExecutionVertex, so it will not affected by 
the execution unregistering.

Theoretically, we should fix all the tests which shuts down the executor before 
terminating the job/scheduler. But I'm afraid there are many such kind of tests 
and we may easily miss to fix some of them. Therefore, I'm thinking to changing 
the ExecutionDeployer to not retrieve executions via 
ExecutionGraph#currentExecutions, to tolerate this case as before, and to also 
avoid similar problems that may happen in production.

> RemoveCachedShuffleDescriptorTest#testRemoveOffloadedCacheForPointwiseEdgeAfterFailover
>  causes fatal error on CI
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-28392
>                 URL: https://issues.apache.org/jira/browse/FLINK-28392
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.16.0
>            Reporter: Martijn Visser
>            Assignee: Chesnay Schepler
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.16.0
>
>
> {code:java}
> Jul 05 03:30:03 [ERROR] Error occurred in starting fork, check output in log
> Jul 05 03:30:03 [ERROR] Process Exit Code: 239
> Jul 05 03:30:03 [ERROR] Crashed tests:
> Jul 05 03:30:03 [ERROR] 
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategyTest
> Jul 05 03:30:03 [ERROR] 
> org.apache.maven.surefire.booter.SurefireBooterForkException: 
> ExecutionException The forked VM terminated without properly saying goodbye. 
> VM crash or System.exit called?
> Jul 05 03:30:03 [ERROR] Command was /bin/sh -c cd /__w/1/s/flink-runtime && 
> /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -XX:+UseG1GC -Xms256m -Xmx768m 
> -jar 
> /__w/1/s/flink-runtime/target/surefire/surefirebooter4932865857415988980.jar 
> /__w/1/s/flink-runtime/target/surefire 2022-07-05T03-23-25_404-jvmRun1 
> surefire8916732512419442726tmp surefire_2130262314165063415tmp
> Jul 05 03:30:03 [ERROR] Error occurred in starting fork, check output in log
> Jul 05 03:30:03 [ERROR] Process Exit Code: 239
> Jul 05 03:30:03 [ERROR] Crashed tests:
> Jul 05 03:30:03 [ERROR] 
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategyTest
> Jul 05 03:30:03 [ERROR] at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:532)
> Jul 05 03:30:03 [ERROR] at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkOnceMultiple(ForkStarter.java:405)
> Jul 05 03:30:03 [ERROR] at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:321)
> Jul 05 03:30:03 [ERROR] at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:266)
> Jul 05 03:30:03 [ERROR] at 
> org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1314)
> Jul 05 03:30:03 [ERROR] at 
> org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1159)
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=37602&view=logs&j=4d4a0d10-fca2-5507-8eed-c07f0bdf4887&t=7b25afdf-cc6c-566f-5459-359dc2585798&l=8147



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-28392) RemoveCachedShuffleDescriptorTest#testRemoveOffloadedCacheForPointwiseEdgeAfterFailover causes fatal error on CI

Reply via email to