[ https://issues.apache.org/jira/browse/SPARK-17696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Marcelo Vanzin updated SPARK-17696: ----------------------------------- Affects Version/s: (was: 2.0.0) > Race in CoarseGrainedExecutorBackend shutdown can lead to wrong exit status > --------------------------------------------------------------------------- > > Key: SPARK-17696 > URL: https://issues.apache.org/jira/browse/SPARK-17696 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN > Affects Versions: 1.6.0 > Reporter: Marcelo Vanzin > Priority: Minor > > There's a race in the shutdown path of CoarseGrainedExecutorBackend that may > lead to the process exiting with the wrong status. When the race triggers, > you can see things like this in the driver logs in yarn-cluster mode: > {noformat} > 14:38:20,114 [Driver] INFO org.apache.spark.SparkContext - Successfully > stopped SparkContext > {noformat} > And later: > {noformat} > 14:38:22,455 [Reporter] WARN org.apache.spark.deploy.yarn.YarnAllocator - > Container marked as failed: container_1470951093505_0001_01_000002 on host: > xxx.com. Exit status: 1. Diagnostics: Exception from container-launch. > Container id: container_1470951093505_0001_01_000002 > Exit code: 1 > {noformat} > This happens because the user class is still running after the SparkContext > is shut down, so the YarnAllocator instance is alive for long enough to fetch > the exit status of the container. If the race is triggered, the container > exits with the wrong status. In this case, enough containers hit the race > that the application ended up failing due to too many container failures, > even though the app would probably succeed otherwise. > The race is as follows: > - CoarseGrainedExecutorBackend receives a StopExecutor > - Before it can enqueue a "Shutdown" message, the socket is disconnected and > NettyRpcEnv enqueues a "RemoteProcessDisconnected" message > - "RemoteProcessDisconnected" is processed first, and calls "System.exit" > with wrong exit code for this case. > You can see that in the executor logs: both messages are being processed. > {noformat} > 14:38:20,093 [dispatcher-event-loop-9] INFO > org.apache.spark.executor.CoarseGrainedExecutorBackend - Driver commanded a > shutdown > 14:38:20,286 [dispatcher-event-loop-9] ERROR > org.apache.spark.executor.CoarseGrainedExecutorBackend - Driver xxx:40988 > disassociated! Shutting down. > {noformat} > The code needs to avoid this situation by ignoring the disconnect event if > it's already shutting down. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org