[ 
https://issues.apache.org/jira/browse/SPARK-8297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581038#comment-14581038
 ] 

Mridul Muralidharan commented on SPARK-8297:
--------------------------------------------

Spark on mesos handles this situation by calling removeExecutor() on the 
scheduler backend - yarn module does not.
I have this fixed locally, but unfortunately, I do not have the bandwidth to 
shepherd a patch.

The fix is simple - replicate something similar to what is done in 
CoarseMesosSchedulerBackend.slaveLost().
Essentially :
a) maintain a mapping from container-id to executor-id in YarnAllocator 
(consistent with and inverse of executorIdToContainer)
b) propagate the scheduler backend to YarnAllocator when 
YarnClusterScheduler.postCommitHook is called, 
c) In processCompletedContainers, if the container is not in 
releasedContainers, invoke backend.removeExecutor(executorId, msg) to notify 
backend that the executor has not exit'ed gracefully/expectedly.
d) Remove mapping from containerIdToExecutorId and executorIdToContainer in 
processCompletedContainers (The latter also fixes a memory leak in 
YarnAllocator btw).


In case no one is picking this one up, I can fix it later in 1.5 release cycle.

> Scheduler backend is not notified in case node fails in YARN
> ------------------------------------------------------------
>
>                 Key: SPARK-8297
>                 URL: https://issues.apache.org/jira/browse/SPARK-8297
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.2.2, 1.3.1, 1.4.1
>         Environment: Spark on yarn - both client and cluster mode.
>            Reporter: Mridul Muralidharan
>            Priority: Critical
>
> When a node crashes, yarn detects the failure and notifies spark - but this 
> information is not propagated to scheduler backend (unlike in mesos mode, for 
> example).
> It results in repeated re-execution of stages (due to FetchFailedException on 
> shuffle side), resulting finally in application failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to