tgravescs commented on a change in pull request #34366:
URL: https://github.com/apache/spark/pull/34366#discussion_r743114402
##########
File path:
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
} else {
val YarnAppReport(appState, finalState, diags) =
monitorApplication(appId)
if (appState == YarnApplicationState.FAILED || finalState ==
FinalApplicationStatus.FAILED) {
+ var amContainerSuccess = false
diags.foreach { err =>
+ amContainerSuccess = err.contains("AM Container") &&
err.contains("exitCode: 0")
logError(s"Application diagnostics message: $err")
}
- throw new SparkException(s"Application $appId finished with failed
status")
+ if (!amContainerSuccess) {
+ throw new SparkException(s"Application $appId finished with failed
status")
+ }
Review comment:
yeah unfortunately that final application status is what YARN really
uses and advertises and it really wants you to unregister.
How often does this really happen? it seems like if you are timing out
talking to RM very often you have other problems on your cluster. Or does this
happen on rolling upgrade or something?
Note, If we were to do this, then the YARN final status also wouldn't match
because doesn't that show up as failed when it can't unregister? that could
confuse the user.
##########
File path:
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
} else {
val YarnAppReport(appState, finalState, diags) =
monitorApplication(appId)
if (appState == YarnApplicationState.FAILED || finalState ==
FinalApplicationStatus.FAILED) {
+ var amContainerSuccess = false
diags.foreach { err =>
+ amContainerSuccess = err.contains("AM Container") &&
err.contains("exitCode: 0")
logError(s"Application diagnostics message: $err")
}
- throw new SparkException(s"Application $appId finished with failed
status")
+ if (!amContainerSuccess) {
+ throw new SparkException(s"Application $appId finished with failed
status")
+ }
Review comment:
Users complain because miss SLA or something? Retry should be an
acceptable thing sometimes when failures happen. But I get it that its
annoying.
What is the timeout like in this case? Is it the YARN call timed out but
would have answered within a few
seconds or is it YARN isn't going to respond for 10's of minutes or hours?
The other thing we could possibly do is look at adding optional retry logic
in Spark on top of YARN client retry that is configurable where we say we
really want to unregister, so wait and retry in addition to the built in Hadoop
retry logic. That would delay the client knowing its done but would NOT retry
and would keep YARN state consistent. Obviously would need others input on
this as well. This is one reason why I'm asking about when this happens and
exact circumstances so that we could possibly come up with alternate solutions.
Like @mridulm I'm not a fan of parsing the diagnostics.
you can also increase your rpc timeout for talking to the resource manager.
the other way is to handle this outside of Spark.. application has retry
logic instead of spark where it can check to see if for instance it wrote the
expected output and if its there don't retry, otherwise retry.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]