This is an automated email from the ASF dual-hosted git repository.
srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new e70df2c [SPARK-29683][YARN] False report isAllNodeBlacklisted when RM
is having issue
e70df2c is described below
commit e70df2cea46f71461d8d401a420e946f999862c1
Author: Yuexin Zhang <[email protected]>
AuthorDate: Mon Jun 1 09:46:18 2020 -0500
[SPARK-29683][YARN] False report isAllNodeBlacklisted when RM is having
issue
### What changes were proposed in this pull request?
Improve the check logic on if all node managers are really being backlisted.
### Why are the changes needed?
I observed when the AM is out of sync with ResourceManager, or RM is having
issue report back with current number of available NMs, something like below
happens:
...
20/05/13 09:01:21 INFO RetryInvocationHandler: java.io.EOFException: End of
File Exception between local host is: "client.zyx.com/x.x.x.124"; destination
host is: "rm.zyx.com":8030; : java.io.EOFException; For more details see:
http://wiki.apache.org/hadoop/EOFException, while invoking
ApplicationMasterProtocolPBClientImpl.allocate over rm543. Trying to failover
immediately.
...
20/05/13 09:01:28 WARN AMRMClientImpl: ApplicationMaster is out of sync
with ResourceManager, hence resyncing.
...
then the spark job would suddenly run into AllNodeBlacklisted state:
...
20/05/13 09:01:31 INFO ApplicationMaster: Final app status: FAILED,
exitCode: 11, (reason: Due to executor failures all available nodes are
blacklisted)
...
but actually there's no black listed nodes in currentBlacklistedYarnNodes,
and I do not see any blacklisting message from:
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala#L119
We should only return isAllNodeBlacklisted =true when we see there are >0
numClusterNodes AND 'currentBlacklistedYarnNodes.size >= numClusterNodes'.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
A minor change. No changes on tests.
Closes #28606 from cnZach/false_AllNodeBlacklisted_when_RM_is_having_issue.
Authored-by: Yuexin Zhang <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
---
.../apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala
index fa8c961..339d371 100644
---
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala
+++
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala
@@ -103,7 +103,14 @@ private[spark] class YarnAllocatorBlacklistTracker(
refreshBlacklistedNodes()
}
- def isAllNodeBlacklisted: Boolean = currentBlacklistedYarnNodes.size >=
numClusterNodes
+ def isAllNodeBlacklisted: Boolean = {
+ if (numClusterNodes <= 0) {
+ logWarning("No available nodes reported, please check Resource Manager.")
+ false
+ } else {
+ currentBlacklistedYarnNodes.size >= numClusterNodes
+ }
+ }
private def refreshBlacklistedNodes(): Unit = {
removeExpiredYarnBlacklistedNodes()
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]