[ https://issues.apache.org/jira/browse/SPARK-9416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646365#comment-14646365 ]
Apache Spark commented on SPARK-9416: ------------------------------------- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/7751 > Yarn logs say that Spark Python job has succeeded even though job has failed > in Yarn cluster mode > ------------------------------------------------------------------------------------------------- > > Key: SPARK-9416 > URL: https://issues.apache.org/jira/browse/SPARK-9416 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.4.1 > Environment: 3.13.0-53-generic #89-Ubuntu SMP Wed May 20 10:34:39 UTC > 2015 x86_64 x86_64 x86_64 GNU/Linux > Reporter: Elkhan Dadashov > > While running Spark Word count python example with intentional mistake in > Yarn cluster mode, Spark terminal logs (Yarn logs) states final status as > SUCCEEDED, but log files for Spark application state correct results > indicating that the job failed. > Terminal log output & application log output contradict each other. > If i run same job on local mode then terminal logs and application logs > match, where both state that job has failed to expected error in python > script. > More details: Scenario > While running Spark Word count python example on Yarn cluster mode, if I make > intentional error in wordcount.py by changing this line (I'm using Spark > 1.4.1, but this problem exists in Spark 1.4.0 and in 1.3.0 versions - which i > tested): > lines = sc.textFile(sys.argv[1], 1) > into this line: > lines = sc.textFile(nonExistentVariable,1) > where nonExistentVariable variable was never created and initialized. > then i run that example with this command (I put README.md into HDFS before > running this command): > ./bin/spark-submit --master yarn-cluster wordcount.py /README.md > The job runs and finishes successfully according the log printed in the > terminal : > Terminal logs: > ... > 15/07/23 16:19:17 INFO yarn.Client: Application report for > application_1437612288327_0013 (state: RUNNING) > 15/07/23 16:19:18 INFO yarn.Client: Application report for > application_1437612288327_0013 (state: RUNNING) > 15/07/23 16:19:19 INFO yarn.Client: Application report for > application_1437612288327_0013 (state: RUNNING) > 15/07/23 16:19:20 INFO yarn.Client: Application report for > application_1437612288327_0013 (state: RUNNING) > 15/07/23 16:19:21 INFO yarn.Client: Application report for > application_1437612288327_0013 (state: FINISHED) > 15/07/23 16:19:21 INFO yarn.Client: > client token: N/A > diagnostics: Shutdown hook called before final status was reported. > ApplicationMaster host: 10.0.53.59 > ApplicationMaster RPC port: 0 > queue: default > start time: 1437693551439 > final status: SUCCEEDED > tracking URL: > http://localhost:8088/proxy/application_1437612288327_0013/history/application_1437612288327_0013/1 > user: edadashov > 15/07/23 16:19:21 INFO util.Utils: Shutdown hook called > 15/07/23 16:19:21 INFO util.Utils: Deleting directory > /tmp/spark-eba0a1b5-a216-4afa-9c54-a3cb67b16444 > But if look at log files generated for this application in HDFS - it > indicates failure of the job with correct reason: > Application log files: > ... > \00 stdout\00 179Traceback (most recent call last): > File "wordcount.py", line 32, in <module> > lines = sc.textFile(nonExistentVariable,1) > NameError: name 'nonExistentVariable' is not defined > (Yarn logs to) Terminal output - final status: SUCCEEDED , is not matching > application log results - failure of the job (NameError: name > 'nonExistentVariable' is not defined) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org