[ https://issues.apache.org/jira/browse/SPARK-13182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849742#comment-16849742 ]
Atul Anand edited comment on SPARK-13182 at 5/28/19 1:55 PM: ------------------------------------------------------------- # Yarn policy is to preempt a job in low priority queue for some job in higher priority queue. It is doing exactly that. So IMHO nothing wrong with YARN policy. # YARN users(like Spark, Map-Reduce) decide what to do after preemption due to any reason. If Spark keeps relaunching containers infinitely, preemption is not actually handled. # This behaviour makes YARN job queue passed by "spark.yarn.queue" irrelevant. [~mccheah]'s commit [https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0#diff-bad3987c83bd22d46416d3dd9d208e76R730] made the optimisation to ignore non application failures. IMHO we should have additional counter to limit retries due to non application errors, something like externalFailuresRetries = Inf by default. For other people, who expect external failures to be preemptions only can set it to 1 or 2. was (Author: zxcvmnb): # Yarn policy is to preempt a job in low priority queue for some job in higher priority queue. It is doing exactly that. So IMHO nothing wrong with YARN policy. # YARN users(like Spark, Map-Reduce) decide what to do after preemption due to any reason. If Spark keeps relaunching containers infinitely, preemption is not actually handled. # This behaviour makes YARN job queue passed by "spark.yarn.queue" irrelevant. [~mccheah]'s [commit|[https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0#diff-bad3987c83bd22d46416d3dd9d208e76R730]] made the optimisation to ignore non application failures. IMHO we should have additional counter to limit retries due to non application errors, something like externalFailuresRetries = Inf by default. For other people, who expect external failures to be preemptions only can set it to 1 or 2. > Spark Executor retries infinitely > --------------------------------- > > Key: SPARK-13182 > URL: https://issues.apache.org/jira/browse/SPARK-13182 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.5.2 > Reporter: Prabhu Joseph > Priority: Minor > > When a Spark job (Spark-1.5.2) is submitted with a single executor and if > user passes some wrong JVM arguments with spark.executor.extraJavaOptions, > the first executor fails. But the job keeps on retrying, creating a new > executor and failing every time, until CTRL-C is pressed. > ./spark-submit --class SimpleApp --master "spark://10.10.72.145:7077" --conf > "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps > -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=16" > /SPARK/SimpleApp.jar > Here when user submits job with ConcGCThreads 16 which is greater than > ParallelGCThreads, JVM will crash. But the job does not exit, keeps on > creating executors and retrying. > .......... > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID > app-20160201065319-0014/2846 on hostPort 10.10.72.145:36558 with 12 cores, > 2.0 GB RAM > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2846 is now LOADING > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2846 is now RUNNING > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2846 is now EXITED (Command exited with code 1) > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor > app-20160201065319-0014/2846 removed: Command exited with code 1 > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove > non-existent executor 2846 > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: > app-20160201065319-0014/2847 on worker-20160131230345-10.10.72.145-36558 > (10.10.72.145:36558) with 12 cores > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID > app-20160201065319-0014/2847 on hostPort 10.10.72.145:36558 with 12 cores, > 2.0 GB RAM > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2847 is now LOADING > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2847 is now EXITED (Command exited with code 1) > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor > app-20160201065319-0014/2847 removed: Command exited with code 1 > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove > non-existent executor 2847 > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: > app-20160201065319-0014/2848 on worker-20160131230345-10.10.72.145-36558 > (10.10.72.145:36558) with 12 cores > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID > app-20160201065319-0014/2848 on hostPort 10.10.72.145:36558 with 12 cores, > 2.0 GB RAM > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2848 is now LOADING > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2848 is now RUNNING > Spark should not fall into a trap on these kind of user errors on a > production cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org