attilapiros commented on issue #26343: [SPARK-29683][YARN] Job will fail due to executor failures all available nodes are blacklisted URL: https://github.com/apache/spark/pull/26343#issuecomment-590883307 @uncleGen I have checked this on a cluster and I would not use the `spark.blacklist.waiting.millis` for every case when there is no more nodes to allocate on as this would mix the following two cases: - there are cluster nodes but all the nodes are blacklisted by Spark (waiting is not needed we can stop right away) - there is no available nodes at all because of RM failover (we could wait) So what about only using the timer when there is no reported available nodes? **Or not using the timer at all.** Stopping at the YARN RM failover can be avoided by this few line change: ``` - def isAllNodeBlacklisted: Boolean = currentBlacklistedYarnNodes.size >= numClusterNodes + def isAllNodeBlacklisted: Boolean = + numClusterNodes != 0 && currentBlacklistedYarnNodes.size >= numClusterNodes ``` I know in this case we would wait unconditionally to RM (as it was before SPARK-16630) but I think this is an operational issue at YARN and we could keep this old behavior. The few line change was tested with stopping/starting the RM daemon manually for each RM nodes (as on this cluster the auto-failover was on): ``` $ yarn --daemon stop resourcemanager $ yarn --daemon start resourcemanager ``` cc @squito
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
