[
https://issues.apache.org/jira/browse/SPARK-13182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849742#comment-16849742
]
Atul Anand commented on SPARK-13182:
------------------------------------
# Yarn policy is to preempt a job in low priority queue for some job in higher
priority queue. It is doing exactly that. So IMHO nothing wrong with YARN
policy.
# YARN users(like Spark, Map-Reduce) decide what to do after preemption due to
any reason. If Spark keeps relaunching containers infinitely, preemption is not
actually handled.
# This behaviour makes YARN job queue passed by "spark.yarn.queue" irrelevant.
[~mccheah]'s
[commit|[https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0#diff-bad3987c83bd22d46416d3dd9d208e76R730]]
made the optimisation to ignore non application failures.
IMHO we should have additional counter to limit retries due to non application
errors, something like externalFailuresRetries = Inf by default.
For other people, who expect external failures to be preemptions only can set
it to 1 or 2.
> Spark Executor retries infinitely
> ---------------------------------
>
> Key: SPARK-13182
> URL: https://issues.apache.org/jira/browse/SPARK-13182
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.5.2
> Reporter: Prabhu Joseph
> Priority: Minor
>
> When a Spark job (Spark-1.5.2) is submitted with a single executor and if
> user passes some wrong JVM arguments with spark.executor.extraJavaOptions,
> the first executor fails. But the job keeps on retrying, creating a new
> executor and failing every time, until CTRL-C is pressed.
> ./spark-submit --class SimpleApp --master "spark://10.10.72.145:7077" --conf
> "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps
> -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=16"
> /SPARK/SimpleApp.jar
> Here when user submits job with ConcGCThreads 16 which is greater than
> ParallelGCThreads, JVM will crash. But the job does not exit, keeps on
> creating executors and retrying.
> ..........
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20160201065319-0014/2846 on hostPort 10.10.72.145:36558 with 12 cores,
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
> app-20160201065319-0014/2846 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
> app-20160201065319-0014/2846 is now RUNNING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
> app-20160201065319-0014/2846 is now EXITED (Command exited with code 1)
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor
> app-20160201065319-0014/2846 removed: Command exited with code 1
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove
> non-existent executor 2846
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added:
> app-20160201065319-0014/2847 on worker-20160131230345-10.10.72.145-36558
> (10.10.72.145:36558) with 12 cores
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20160201065319-0014/2847 on hostPort 10.10.72.145:36558 with 12 cores,
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
> app-20160201065319-0014/2847 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
> app-20160201065319-0014/2847 is now EXITED (Command exited with code 1)
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor
> app-20160201065319-0014/2847 removed: Command exited with code 1
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove
> non-existent executor 2847
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added:
> app-20160201065319-0014/2848 on worker-20160131230345-10.10.72.145-36558
> (10.10.72.145:36558) with 12 cores
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20160201065319-0014/2848 on hostPort 10.10.72.145:36558 with 12 cores,
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
> app-20160201065319-0014/2848 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
> app-20160201065319-0014/2848 is now RUNNING
> Spark should not fall into a trap on these kind of user errors on a
> production cluster.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]