Ryan Williams created SPARK-6449:
------------------------------------

             Summary: Driver OOM results in reported application result SUCCESS
                 Key: SPARK-6449
                 URL: https://issues.apache.org/jira/browse/SPARK-6449
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.3.0
            Reporter: Ryan Williams


I ran a job yesterday that according to the History Server and YARN RM finished 
with status {{SUCCESS}}.

Clicking around on the history server UI, there were too few stages run, and I 
couldn't figure out why that would have been.

Finally, inspecting the end of the driver's logs, I saw:
{code}
15/03/20 15:08:13 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
Shutting down remote daemon.
15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote 
daemon shut down; proceeding with flushing remote transports.
15/03/20 15:08:13 INFO spark.SparkContext: Successfully stopped SparkContext
Exception in thread "Driver" scala.MatchError: java.lang.OutOfMemoryError: GC 
overhead limit exceeded (of class java.lang.OutOfMemoryError)
        at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485)
15/03/20 15:08:13 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, 
exitCode: 0, (reason: Shutdown hook called before final status was reported.)
15/03/20 15:08:13 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster 
with SUCCEEDED (diag message: Shutdown hook called before final status was 
reported.)
15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
Remoting shut down.
15/03/20 15:08:13 INFO impl.AMRMClientImpl: Waiting for application to be 
successfully unregistered.
15/03/20 15:08:13 INFO yarn.ApplicationMaster: Deleting staging directory 
.sparkStaging/application_1426705269584_0055
{code}

The driver OOM'd, [the {{catch}} block that presumably should have caught 
it|https://github.com/apache/spark/blob/b6090f902e6ec24923b4dde4aabc9076956521c1/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L484]
 threw a {{MatchError}}, and then {{SUCCESS}} was returned to YARN and written 
to the event log.

This should be logged as a failed job and reported as such to YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to