sungpeo opened a new pull request #35089: URL: https://github.com/apache/spark/pull/35089
[SPARK-29683][YARN] False report isAllNodeBlacklisted when RM is having issue ### What changes were proposed in this pull request? Improve the check logic on if all node managers are really being backlisted. ### Why are the changes needed? I observed when the AM is out of sync with ResourceManager, or RM is having issue report back with current number of available NMs, something like below happens: ... 20/05/13 09:01:21 INFO RetryInvocationHandler: java.io.EOFException: End of File Exception between local host is: "client.zyx.com/x.x.x.124"; destination host is: "rm.zyx.com":8030; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException, while invoking ApplicationMasterProtocolPBClientImpl.allocate over rm543. Trying to failover immediately. ... 20/05/13 09:01:28 WARN AMRMClientImpl: ApplicationMaster is out of sync with ResourceManager, hence resyncing. ... then the spark job would suddenly run into AllNodeBlacklisted state: ... 20/05/13 09:01:31 INFO ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Due to executor failures all available nodes are blacklisted) ... but actually there's no black listed nodes in currentBlacklistedYarnNodes, and I do not see any blacklisting message from: https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala#L119 We should only return isAllNodeBlacklisted =true when we see there are >0 numClusterNodes AND 'currentBlacklistedYarnNodes.size >= numClusterNodes'. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A minor change. No changes on tests. Closes #28606 from cnZach/false_AllNodeBlacklisted_when_RM_is_having_issue. Authored-by: Yuexin Zhang <[email protected]> Signed-off-by: Sean Owen <[email protected]> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
