[ 
https://issues.apache.org/jira/browse/SPARK-17696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-17696:
-----------------------------------
    Affects Version/s:     (was: 2.0.0)

> Race in CoarseGrainedExecutorBackend shutdown can lead to wrong exit status
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-17696
>                 URL: https://issues.apache.org/jira/browse/SPARK-17696
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, YARN
>    Affects Versions: 1.6.0
>            Reporter: Marcelo Vanzin
>            Priority: Minor
>
> There's a race in the shutdown path of CoarseGrainedExecutorBackend that may 
> lead to the process exiting with the wrong status. When the race triggers, 
> you can see things like this in the driver logs in yarn-cluster mode:
> {noformat}
> 14:38:20,114 [Driver] INFO  org.apache.spark.SparkContext - Successfully 
> stopped SparkContext
> {noformat}
> And later:
> {noformat}
> 14:38:22,455 [Reporter] WARN  org.apache.spark.deploy.yarn.YarnAllocator - 
> Container marked as failed: container_1470951093505_0001_01_000002 on host: 
> xxx.com. Exit status: 1. Diagnostics: Exception from container-launch.
> Container id: container_1470951093505_0001_01_000002
> Exit code: 1
> {noformat}
> This happens because the user class is still running after the SparkContext 
> is shut down, so the YarnAllocator instance is alive for long enough to fetch 
> the exit status of the container. If the race is triggered, the container 
> exits with the wrong status. In this case, enough containers hit the race 
> that the application ended up failing due to too many container failures, 
> even though the app would probably succeed otherwise.
> The race is as follows:
> - CoarseGrainedExecutorBackend receives a StopExecutor
> - Before it can enqueue a "Shutdown" message, the socket is disconnected and 
> NettyRpcEnv enqueues a "RemoteProcessDisconnected" message
> - "RemoteProcessDisconnected" is processed first, and calls "System.exit" 
> with wrong exit code for this case.
> You can see that in the executor logs: both messages are being processed.
> {noformat}
> 14:38:20,093 [dispatcher-event-loop-9] INFO  
> org.apache.spark.executor.CoarseGrainedExecutorBackend - Driver commanded a 
> shutdown
> 14:38:20,286 [dispatcher-event-loop-9] ERROR 
> org.apache.spark.executor.CoarseGrainedExecutorBackend - Driver xxx:40988 
> disassociated! Shutting down.
> {noformat}
> The code needs to avoid this situation by ignoring the disconnect event if 
> it's already shutting down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to