[ 
https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387846#comment-16387846
 ] 

Thomas Graves commented on SPARK-16630:
---------------------------------------

yes something along these lines is what I was thinking. we would want a 
configurable number of failures (perhaps we can reuse one of the existing 
settings, but woudl need to think about more) at which point we would blacklist 
the node due to executor launch failures and we could have a timeout at which 
point we could retry.  We also want to take into account small clusters and 
perhaps stop blacklisting if a certain percent of the cluster is already 
blacklisted.

> Blacklist a node if executors won't launch on it.
> -------------------------------------------------
>
>                 Key: SPARK-16630
>                 URL: https://issues.apache.org/jira/browse/SPARK-16630
>             Project: Spark
>          Issue Type: Improvement
>          Components: YARN
>    Affects Versions: 1.6.2
>            Reporter: Thomas Graves
>            Priority: Major
>
> On YARN, its possible that a node is messed or misconfigured such that a 
> container won't launch on it.  For instance if the Spark external shuffle 
> handler didn't get loaded on it , maybe its just some other hardware issue or 
> hadoop configuration issue. 
> It would be nice we could recognize this happening and stop trying to launch 
> executors on it since that could end up causing us to hit our max number of 
> executor failures and then kill the job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to