sungpeo opened a new pull request #35089:
URL: https://github.com/apache/spark/pull/35089


   [SPARK-29683][YARN] False report isAllNodeBlacklisted when RM is having issue
   
   ### What changes were proposed in this pull request?
   
   Improve the check logic on if all node managers are really being backlisted.
   
   ### Why are the changes needed?
   
   I observed when the AM is out of sync with ResourceManager, or RM is having 
issue report back with current number of available NMs, something like below 
happens:
   ...
   20/05/13 09:01:21 INFO RetryInvocationHandler: java.io.EOFException: End of 
File Exception between local host is: "client.zyx.com/x.x.x.124"; destination 
host is: "rm.zyx.com":8030; : java.io.EOFException; For more details see:  
http://wiki.apache.org/hadoop/EOFException, while invoking 
ApplicationMasterProtocolPBClientImpl.allocate over rm543. Trying to failover 
immediately.
   ...
   20/05/13 09:01:28 WARN AMRMClientImpl: ApplicationMaster is out of sync with 
ResourceManager, hence resyncing.
   ...
   
   then the spark job would suddenly run into AllNodeBlacklisted state:
   ...
   20/05/13 09:01:31 INFO ApplicationMaster: Final app status: FAILED, 
exitCode: 11, (reason: Due to executor failures all available nodes are 
blacklisted)
   ...
   
   but actually there's no black listed nodes in currentBlacklistedYarnNodes, 
and I do not see any blacklisting message from:
   
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala#L119
   
   We should only return isAllNodeBlacklisted =true when we see there are >0  
numClusterNodes AND 'currentBlacklistedYarnNodes.size >= numClusterNodes'.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   A minor change. No changes on tests.
   
   Closes #28606 from cnZach/false_AllNodeBlacklisted_when_RM_is_having_issue.
   
   Authored-by: Yuexin Zhang <[email protected]>
   Signed-off-by: Sean Owen <[email protected]>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to