Spark Executor retries infinitely
Hi All, When a Spark job (Spark-1.5.2) is submitted with a single executor and if user passes some wrong JVM arguments with spark.executor.extraJavaOptions, the first executor fails. But the job keeps on retrying, creating a new executor and failing every tim*e, *until CTRL-C is pressed*. *Do we have configuration to limit the retry attempts. *Example:* ./spark-submit --class SimpleApp --master "spark://10.10.72.145:7077" --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=16" /SPARK/SimpleApp.jar Executor fails with Error occurred during initialization of VM Can't have more ConcGCThreads than ParallelGCThreads. But the job does not exit, keeps on creating executors and retrying. .. 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: *Granted executor ID app-20160201065319-0014/2846* on hostPort 10.10.72.145:36558 with 12 cores, 2.0 GB RAM 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: app-20160201065319-0014/2846 is now LOADING 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: app-20160201065319-0014/2846 is now RUNNING 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: app-20160201065319-0014/2846 is now EXITED (Command exited with code 1) 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor app-20160201065319-0014/2846 removed: Command exited with code 1 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove non-existent executor 2846 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: *Executor added: app-20160201065319-0014/2847* on worker-20160131230345-10.10.72.145-36558 ( 10.10.72.145:36558) with 12 cores 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160201065319-0014/2847 on hostPort 10.10.72.145:36558 with 12 cores, 2.0 GB RAM 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: app-20160201065319-0014/2847 is now LOADING 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: app-20160201065319-0014/2847 is now EXITED (Command exited with code 1) 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor app-20160201065319-0014/2847 removed: Command exited with code 1 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove non-existent executor 2847 16/02/01 06:54:28 INFO AppClient$ClientEndpoint:* Executor added: app-20160201065319-0014/2848* on worker-20160131230345-10.10.72.145-36558 ( 10.10.72.145:36558) with 12 cores 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160201065319-0014/2848 on hostPort 10.10.72.145:36558 with 12 cores, 2.0 GB RAM 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: app-20160201065319-0014/2848 is now LOADING 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: app-20160201065319-0014/2848 is now RUNNING Thanks, Prabhu Joseph
Re: Spark Executor retries infinitely
Thanks Ted. My concern is how to avoid these kind of user errors on a production cluster, it would be better if Spark handles this instead of creating an Executor for every second and fails and overloading the Spark Master. Shall i report a Spark JIRA to handle this. Thanks, Prabhu Joseph On Mon, Feb 1, 2016 at 9:09 PM, Ted Yuwrote: > I haven't found config knob for controlling the retry count after brief > search. > > According to > http://www.oracle.com/technetwork/articles/java/g1gc-1984535.html , > default value for -XX:ParallelGCThreads= seems to be 8. > This seems to explain why you got the VM initialization error. > > FYI > > On Mon, Feb 1, 2016 at 4:16 AM, Prabhu Joseph > wrote: > >> Hi All, >> >> When a Spark job (Spark-1.5.2) is submitted with a single executor and >> if user passes some wrong JVM arguments with >> spark.executor.extraJavaOptions, the first executor fails. But the job >> keeps on retrying, creating a new executor and failing every tim*e, *until >> CTRL-C is pressed*. *Do we have configuration to limit the retry >> attempts. >> >> *Example:* >> >> ./spark-submit --class SimpleApp --master "spark://10.10.72.145:7077" >> --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails >> -XX:+PrintGCTimeStamps -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 >> -XX:ConcGCThreads=16" /SPARK/SimpleApp.jar >> >> Executor fails with >> >> Error occurred during initialization of VM >> Can't have more ConcGCThreads than ParallelGCThreads. >> >> But the job does not exit, keeps on creating executors and retrying. >> .. >> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: *Granted executor ID >> app-20160201065319-0014/2846* on hostPort 10.10.72.145:36558 with 12 >> cores, 2.0 GB RAM >> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: >> app-20160201065319-0014/2846 is now LOADING >> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: >> app-20160201065319-0014/2846 is now RUNNING >> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: >> app-20160201065319-0014/2846 is now EXITED (Command exited with code 1) >> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor >> app-20160201065319-0014/2846 removed: Command exited with code 1 >> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove >> non-existent executor 2846 >> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: *Executor added: >> app-20160201065319-0014/2847* on >> worker-20160131230345-10.10.72.145-36558 (10.10.72.145:36558) with 12 >> cores >> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID >> app-20160201065319-0014/2847 on hostPort 10.10.72.145:36558 with 12 >> cores, 2.0 GB RAM >> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: >> app-20160201065319-0014/2847 is now LOADING >> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: >> app-20160201065319-0014/2847 is now EXITED (Command exited with code 1) >> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor >> app-20160201065319-0014/2847 removed: Command exited with code 1 >> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove >> non-existent executor 2847 >> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint:* Executor added: >> app-20160201065319-0014/2848* on >> worker-20160131230345-10.10.72.145-36558 (10.10.72.145:36558) with 12 >> cores >> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID >> app-20160201065319-0014/2848 on hostPort 10.10.72.145:36558 with 12 >> cores, 2.0 GB RAM >> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: >> app-20160201065319-0014/2848 is now LOADING >> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: >> app-20160201065319-0014/2848 is now RUNNING >> >> >> >> >> Thanks, >> Prabhu Joseph >> >> >> >
Re: Spark Executor retries infinitely
I haven't found config knob for controlling the retry count after brief search. According to http://www.oracle.com/technetwork/articles/java/g1gc-1984535.html , default value for -XX:ParallelGCThreads= seems to be 8. This seems to explain why you got the VM initialization error. FYI On Mon, Feb 1, 2016 at 4:16 AM, Prabhu Josephwrote: > Hi All, > > When a Spark job (Spark-1.5.2) is submitted with a single executor and > if user passes some wrong JVM arguments with > spark.executor.extraJavaOptions, the first executor fails. But the job > keeps on retrying, creating a new executor and failing every tim*e, *until > CTRL-C is pressed*. *Do we have configuration to limit the retry attempts. > > *Example:* > > ./spark-submit --class SimpleApp --master "spark://10.10.72.145:7077" > --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 > -XX:ConcGCThreads=16" /SPARK/SimpleApp.jar > > Executor fails with > > Error occurred during initialization of VM > Can't have more ConcGCThreads than ParallelGCThreads. > > But the job does not exit, keeps on creating executors and retrying. > .. > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: *Granted executor ID > app-20160201065319-0014/2846* on hostPort 10.10.72.145:36558 with 12 > cores, 2.0 GB RAM > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2846 is now LOADING > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2846 is now RUNNING > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2846 is now EXITED (Command exited with code 1) > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor > app-20160201065319-0014/2846 removed: Command exited with code 1 > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove > non-existent executor 2846 > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: *Executor added: > app-20160201065319-0014/2847* on worker-20160131230345-10.10.72.145-36558 > (10.10.72.145:36558) with 12 cores > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID > app-20160201065319-0014/2847 on hostPort 10.10.72.145:36558 with 12 > cores, 2.0 GB RAM > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2847 is now LOADING > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2847 is now EXITED (Command exited with code 1) > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor > app-20160201065319-0014/2847 removed: Command exited with code 1 > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove > non-existent executor 2847 > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint:* Executor added: > app-20160201065319-0014/2848* on worker-20160131230345-10.10.72.145-36558 > (10.10.72.145:36558) with 12 cores > 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID > app-20160201065319-0014/2848 on hostPort 10.10.72.145:36558 with 12 > cores, 2.0 GB RAM > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2848 is now LOADING > 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: > app-20160201065319-0014/2848 is now RUNNING > > > > > Thanks, > Prabhu Joseph > > >