attilapiros commented on issue #26343: [SPARK-29683][YARN] Job will fail due to 
executor failures all available nodes are blacklisted
URL: https://github.com/apache/spark/pull/26343#issuecomment-590883307
 
 
   @uncleGen I have checked this on a cluster and I would not use the 
`spark.blacklist.waiting.millis` for every case when there is no more nodes to 
allocate on as this would mix the following two cases:
   - there are cluster nodes but all the nodes are blacklisted by Spark 
(waiting is not needed we can stop right away)
   - there is no available nodes at all because of RM failover (we could wait) 
   
   So what about only using the timer when there is no reported available 
nodes? 
   
   **Or not using the timer at all.** Stopping at the YARN RM failover can be 
avoided by this few line change:
   
   ```
   -  def isAllNodeBlacklisted: Boolean = currentBlacklistedYarnNodes.size >= 
numClusterNodes
   +  def isAllNodeBlacklisted: Boolean =
   +    numClusterNodes != 0 && currentBlacklistedYarnNodes.size >= 
numClusterNodes
   ```
   
   I know in this case we would wait unconditionally to RM (as it was before 
SPARK-16630) but I think this is an operational issue at YARN and we could keep 
this old behavior.  
   
   The few line change was tested with stopping/starting the RM daemon manually 
for each RM nodes (as on this cluster the auto-failover was on):
   
   ```
   $ yarn --daemon stop resourcemanager
   $ yarn --daemon start resourcemanager
   ```
   
   cc @squito 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to