[GitHub] [spark] tgravescs commented on a change in pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

GitBox Fri, 05 Nov 2021 12:50:03 -0700


tgravescs commented on a change in pull request #34366:
URL: https://github.com/apache/spark/pull/34366#discussion_r743114402




##########
File path: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = 
monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == 
FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && 
err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed 
status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed 
status")
+        }

Review comment:
       yeah unfortunately that final application status is what YARN really 
uses and advertises and it really wants you to unregister.
   
   How often does this really happen?  it seems like if you are timing out 
talking to RM very often you have other problems on your cluster.  Or does this 
happen on rolling upgrade or something?
   
   Note, If we were to do this, then the YARN final status also wouldn't match 
because doesn't that show up as failed when it can't unregister?  that could 
confuse the user.
   
   

##########
File path: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = 
monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == 
FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && 
err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed 
status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed 
status")
+        }

Review comment:
       Users complain because miss SLA or something?  Retry should be an 
acceptable thing sometimes when failures happen.  But I get it that its 
annoying.
   
   What is the timeout like in this case?  Is it the YARN call timed out but 
would have answered within a few 
   seconds or is it YARN isn't going to respond for 10's of minutes or hours?   
   
   The other thing we could possibly do is look at adding optional retry logic 
in Spark on top of YARN client retry that is configurable where we say we 
really want to unregister, so wait and retry in addition to the built in Hadoop 
retry logic.  That would delay the client knowing its done but would NOT retry 
and would keep YARN state consistent.  Obviously would need others input on 
this as well.   This is one reason why I'm asking about when this happens and 
exact circumstances so that we could possibly come up with alternate solutions.
   
   Like @mridulm  I'm not a fan of parsing the diagnostics.
   
   you can also increase your rpc timeout for talking to the resource manager. 
   
   the other way is to handle this outside of Spark.. application has retry 
logic instead of spark where it can check to see if for instance it wrote the 
expected output and if its there don't retry, otherwise retry.  
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] tgravescs commented on a change in pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Reply via email to