tgravescs commented on a change in pull request #34366:
URL: https://github.com/apache/spark/pull/34366#discussion_r743858964
##########
File path:
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
} else {
val YarnAppReport(appState, finalState, diags) =
monitorApplication(appId)
if (appState == YarnApplicationState.FAILED || finalState ==
FinalApplicationStatus.FAILED) {
+ var amContainerSuccess = false
diags.foreach { err =>
+ amContainerSuccess = err.contains("AM Container") &&
err.contains("exitCode: 0")
logError(s"Application diagnostics message: $err")
}
- throw new SparkException(s"Application $appId finished with failed
status")
+ if (!amContainerSuccess) {
+ throw new SparkException(s"Application $appId finished with failed
status")
+ }
Review comment:
Users complain because miss SLA or something? Retry should be an
acceptable thing sometimes when failures happen. But I get it that its
annoying.
What is the timeout like in this case? Is it the YARN call timed out but
would have answered within a few
seconds or is it YARN isn't going to respond for 10's of minutes or hours?
The other thing we could possibly do is look at adding optional retry logic
in Spark on top of YARN client retry that is configurable where we say we
really want to unregister, so wait and retry in addition to the built in Hadoop
retry logic. That would delay the client knowing its done but would NOT retry
and would keep YARN state consistent. Obviously would need others input on
this as well. This is one reason why I'm asking about when this happens and
exact circumstances so that we could possibly come up with alternate solutions.
Like @mridulm I'm not a fan of parsing the diagnostics.
you can also increase your rpc timeout for talking to the resource manager.
the other way is to handle this outside of Spark.. application has retry
logic instead of spark where it can check to see if for instance it wrote the
expected output and if its there don't retry, otherwise retry.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]