[
https://issues.apache.org/jira/browse/SPARK-17696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shixiong Zhu resolved SPARK-17696.
----------------------------------
Resolution: Fixed
Fix Version/s: 2.0.0
1.6.3
> Race in CoarseGrainedExecutorBackend shutdown can lead to wrong exit status
> ---------------------------------------------------------------------------
>
> Key: SPARK-17696
> URL: https://issues.apache.org/jira/browse/SPARK-17696
> Project: Spark
> Issue Type: Bug
> Components: Spark Core, YARN
> Affects Versions: 1.6.0
> Reporter: Marcelo Vanzin
> Priority: Minor
> Fix For: 1.6.3, 2.0.0
>
>
> There's a race in the shutdown path of CoarseGrainedExecutorBackend that may
> lead to the process exiting with the wrong status. When the race triggers,
> you can see things like this in the driver logs in yarn-cluster mode:
> {noformat}
> 14:38:20,114 [Driver] INFO org.apache.spark.SparkContext - Successfully
> stopped SparkContext
> {noformat}
> And later:
> {noformat}
> 14:38:22,455 [Reporter] WARN org.apache.spark.deploy.yarn.YarnAllocator -
> Container marked as failed: container_1470951093505_0001_01_000002 on host:
> xxx.com. Exit status: 1. Diagnostics: Exception from container-launch.
> Container id: container_1470951093505_0001_01_000002
> Exit code: 1
> {noformat}
> This happens because the user class is still running after the SparkContext
> is shut down, so the YarnAllocator instance is alive for long enough to fetch
> the exit status of the container. If the race is triggered, the container
> exits with the wrong status. In this case, enough containers hit the race
> that the application ended up failing due to too many container failures,
> even though the app would probably succeed otherwise.
> The race is as follows:
> - CoarseGrainedExecutorBackend receives a StopExecutor
> - Before it can enqueue a "Shutdown" message, the socket is disconnected and
> NettyRpcEnv enqueues a "RemoteProcessDisconnected" message
> - "RemoteProcessDisconnected" is processed first, and calls "System.exit"
> with wrong exit code for this case.
> You can see that in the executor logs: both messages are being processed.
> {noformat}
> 14:38:20,093 [dispatcher-event-loop-9] INFO
> org.apache.spark.executor.CoarseGrainedExecutorBackend - Driver commanded a
> shutdown
> 14:38:20,286 [dispatcher-event-loop-9] ERROR
> org.apache.spark.executor.CoarseGrainedExecutorBackend - Driver xxx:40988
> disassociated! Shutting down.
> {noformat}
> The code needs to avoid this situation by ignoring the disconnect event if
> it's already shutting down.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]