Elkhan Dadashov created SPARK-9416:
--------------------------------------

             Summary: Yarn logs say that Spark Python job has succeeded even 
though job has failed in Yarn cluster mode
                 Key: SPARK-9416
                 URL: https://issues.apache.org/jira/browse/SPARK-9416
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 1.4.1
         Environment: 3.13.0-53-generic #89-Ubuntu SMP Wed May 20 10:34:39 UTC 
2015 x86_64 x86_64 x86_64 GNU/Linux
            Reporter: Elkhan Dadashov


While running Spark Word count python example with intentional mistake in Yarn 
cluster mode, Spark terminal logs (Yarn logs) states final status as SUCCEEDED, 
but log files for Spark application state correct results indicating that the 
job failed.

Terminal log output & application log output contradict each other.

If i run same job on local mode then terminal logs and application logs match, 
where both state that job has failed to expected error in python script.

More details: Scenario

While running Spark Word count python example on Yarn cluster mode, if I make 
intentional error in wordcount.py by changing this line (I'm using Spark 1.4.1, 
but this problem exists in Spark 1.4.0 and in 1.3.0 versions - which i tested):

lines = sc.textFile(sys.argv[1], 1)

into this line:

lines = sc.textFile(nonExistentVariable,1)

where nonExistentVariable variable was never created and initialized.

then i run that example with this command (I put README.md into HDFS before 
running this command): 

./bin/spark-submit --master yarn-cluster wordcount.py /README.md

The job runs and finishes successfully according the log printed in the 
terminal :
Terminal logs:
...
15/07/23 16:19:17 INFO yarn.Client: Application report for 
application_1437612288327_0013 (state: RUNNING)
15/07/23 16:19:18 INFO yarn.Client: Application report for 
application_1437612288327_0013 (state: RUNNING)
15/07/23 16:19:19 INFO yarn.Client: Application report for 
application_1437612288327_0013 (state: RUNNING)
15/07/23 16:19:20 INFO yarn.Client: Application report for 
application_1437612288327_0013 (state: RUNNING)
15/07/23 16:19:21 INFO yarn.Client: Application report for 
application_1437612288327_0013 (state: FINISHED)
15/07/23 16:19:21 INFO yarn.Client: 
         client token: N/A
         diagnostics: Shutdown hook called before final status was reported.
         ApplicationMaster host: 10.0.53.59
         ApplicationMaster RPC port: 0
         queue: default
         start time: 1437693551439
         final status: SUCCEEDED
         tracking URL: 
http://localhost:8088/proxy/application_1437612288327_0013/history/application_1437612288327_0013/1
         user: edadashov
15/07/23 16:19:21 INFO util.Utils: Shutdown hook called
15/07/23 16:19:21 INFO util.Utils: Deleting directory 
/tmp/spark-eba0a1b5-a216-4afa-9c54-a3cb67b16444

But if look at log files generated for this application in HDFS - it indicates 
failure of the job with correct reason:
Application log files:
...
\00 stdout\00 179Traceback (most recent call last):
  File "wordcount.py", line 32, in <module>
    lines = sc.textFile(nonExistentVariable,1)
NameError: name 'nonExistentVariable' is not defined


(Yarn logs to) Terminal output - final status: SUCCEEDED , is not matching 
application log results - failure of the job (NameError: name 
'nonExistentVariable' is not defined) 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to