Imran Rashid commented on SPARK-16630:

I'd also take {{spark.yarn.max.executor.failures}} into account for figuring 
out a default.  In particular we'd want the default to be below that, so one 
bad node wouldn't kill the app.

Does it makes sense for this to be tied into the generic BlacklistTracker?  I 
guess all the interesting logic will be cluster specific so maybe not.

We also want to take into account small clusters and perhaps stop blacklisting 
if a certain percent of the cluster is already blacklisted.

I don't think that is possible -- is the size of the cluster exposed at all?  
you raise a good point though, we'd need to have some way to detect this, to 
avoid the app just sitting idle indefinitely.

> Blacklist a node if executors won't launch on it.
> -------------------------------------------------
>                 Key: SPARK-16630
>                 URL: https://issues.apache.org/jira/browse/SPARK-16630
>             Project: Spark
>          Issue Type: Improvement
>          Components: YARN
>    Affects Versions: 1.6.2
>            Reporter: Thomas Graves
>            Priority: Major
> On YARN, its possible that a node is messed or misconfigured such that a 
> container won't launch on it.  For instance if the Spark external shuffle 
> handler didn't get loaded on it , maybe its just some other hardware issue or 
> hadoop configuration issue. 
> It would be nice we could recognize this happening and stop trying to launch 
> executors on it since that could end up causing us to hit our max number of 
> executor failures and then kill the job.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to