tgravescs commented on a change in pull request #34366:
URL: https://github.com/apache/spark/pull/34366#discussion_r746690053



##########
File path: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = 
monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == 
FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && 
err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed 
status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed 
status")
+        }

Review comment:
       so I guess I'm kind of wondering if we should do make any change here.
   This doesn't fail the application, it may be unwanted behavior in your case 
that it retries sometimes but if its doing it to often it seems like a cluster 
issue to me.  Hadoop client has built in retries and timeouts that are supposed 
to be configured to account for some of these, so if you are going outside of 
those that should be a out of the norm occurrence.  The Hadoop philosophy IMHO 
is to retry on failures so if this occasionally happens shouldn't be a big 
deal.  If it is, it's up to the application to handle. The Hadoop API for 
properly checking application status is what we are using and this trying to 
infer the status which could be brittle and just adds maintenance for Spark if 
we add yet another config.   
   
   Note Hadoop also has a shutdown hook configuration which I think is the 30 
seconds you are referring to here: `hadoop.service.shutdown.timeout`.   You 
could increase that and increase the Hadoop retries/timeouts.   
   
   It would be nice to get more feedback on if this is a problem for other 
users or if devs have opinions.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to