[
https://issues.apache.org/jira/browse/SPARK-9416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646365#comment-14646365
]
Apache Spark commented on SPARK-9416:
-------------------------------------
User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7751
> Yarn logs say that Spark Python job has succeeded even though job has failed
> in Yarn cluster mode
> -------------------------------------------------------------------------------------------------
>
> Key: SPARK-9416
> URL: https://issues.apache.org/jira/browse/SPARK-9416
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 1.4.1
> Environment: 3.13.0-53-generic #89-Ubuntu SMP Wed May 20 10:34:39 UTC
> 2015 x86_64 x86_64 x86_64 GNU/Linux
> Reporter: Elkhan Dadashov
>
> While running Spark Word count python example with intentional mistake in
> Yarn cluster mode, Spark terminal logs (Yarn logs) states final status as
> SUCCEEDED, but log files for Spark application state correct results
> indicating that the job failed.
> Terminal log output & application log output contradict each other.
> If i run same job on local mode then terminal logs and application logs
> match, where both state that job has failed to expected error in python
> script.
> More details: Scenario
> While running Spark Word count python example on Yarn cluster mode, if I make
> intentional error in wordcount.py by changing this line (I'm using Spark
> 1.4.1, but this problem exists in Spark 1.4.0 and in 1.3.0 versions - which i
> tested):
> lines = sc.textFile(sys.argv[1], 1)
> into this line:
> lines = sc.textFile(nonExistentVariable,1)
> where nonExistentVariable variable was never created and initialized.
> then i run that example with this command (I put README.md into HDFS before
> running this command):
> ./bin/spark-submit --master yarn-cluster wordcount.py /README.md
> The job runs and finishes successfully according the log printed in the
> terminal :
> Terminal logs:
> ...
> 15/07/23 16:19:17 INFO yarn.Client: Application report for
> application_1437612288327_0013 (state: RUNNING)
> 15/07/23 16:19:18 INFO yarn.Client: Application report for
> application_1437612288327_0013 (state: RUNNING)
> 15/07/23 16:19:19 INFO yarn.Client: Application report for
> application_1437612288327_0013 (state: RUNNING)
> 15/07/23 16:19:20 INFO yarn.Client: Application report for
> application_1437612288327_0013 (state: RUNNING)
> 15/07/23 16:19:21 INFO yarn.Client: Application report for
> application_1437612288327_0013 (state: FINISHED)
> 15/07/23 16:19:21 INFO yarn.Client:
> client token: N/A
> diagnostics: Shutdown hook called before final status was reported.
> ApplicationMaster host: 10.0.53.59
> ApplicationMaster RPC port: 0
> queue: default
> start time: 1437693551439
> final status: SUCCEEDED
> tracking URL:
> http://localhost:8088/proxy/application_1437612288327_0013/history/application_1437612288327_0013/1
> user: edadashov
> 15/07/23 16:19:21 INFO util.Utils: Shutdown hook called
> 15/07/23 16:19:21 INFO util.Utils: Deleting directory
> /tmp/spark-eba0a1b5-a216-4afa-9c54-a3cb67b16444
> But if look at log files generated for this application in HDFS - it
> indicates failure of the job with correct reason:
> Application log files:
> ...
> \00 stdout\00 179Traceback (most recent call last):
> File "wordcount.py", line 32, in <module>
> lines = sc.textFile(nonExistentVariable,1)
> NameError: name 'nonExistentVariable' is not defined
> (Yarn logs to) Terminal output - final status: SUCCEEDED , is not matching
> application log results - failure of the job (NameError: name
> 'nonExistentVariable' is not defined)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]