Thanks sai for the input, So the problem is : i start my job with some fixed number of executors, but when a host running my executors goes unreachable, driver reduces the total number of executors. And never increases it.
I have a repro for the issue, attaching logs: !!!! Running spark job is configured for 2 executors, dynamic allocation not enabled !!!!!!! AM starts requesting the 2 executors: 15/10/19 12:25:58 INFO yarn.YarnRMClient: Registering the ApplicationMaster 15/10/19 12:25:59 INFO yarn.YarnAllocator: Will request 2 executor containers, each with 1 cores and 1408 MB memory including 384 MB overhead 15/10/19 12:25:59 INFO yarn.YarnAllocator: Container request (host: Any, capability: <memory:1408, vCores:1>) 15/10/19 12:25:59 INFO yarn.YarnAllocator: Container request (host: Any, capability: <memory:1408, vCores:1>) 15/10/19 12:25:59 INFO yarn.ApplicationMaster: Started progress reporter thread - sleep time : 5000 Executors launched: 15/10/19 12:26:04 INFO impl.AMRMClientImpl: Received new token for : DN-2:58739 15/10/19 12:26:04 INFO impl.AMRMClientImpl: Received new token for : DN-1:44591 15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching container container_1444841612643_0014_01_000002 for on host DN-2 15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching ExecutorRunnable. driverUrl: akka.tcp://sparkDriver@NN-1:35115/user/CoarseGrainedScheduler, executorHostname: DN-2 15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching container container_1444841612643_0014_01_000003 for on host DN-1 15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching ExecutorRunnable. driverUrl: akka.tcp://sparkDriver@NN-1:35115/user/CoarseGrainedScheduler, executorHostname: DN-1 Now my AM and executor 1 are running on DN-2, DN-1 has executor 2 running on it. To reproduce this issue I removed IP from DN-1, until it was timed out by spark. 15/10/19 13:03:30 INFO yarn.YarnAllocator: Driver requested a total number of 1 executor(s). 15/10/19 13:03:30 INFO yarn.ApplicationMaster: Driver requested to kill executor(s) 2. So the driver has reduced the total number of executor to : 1 And now even when the DN comes up and rejoins the cluster, this count is not increased. If I had executor 1 running on a separate DN (not the same as AM's DN), and that DN went unreachable, driver would reduce total number of executor to : 0 and the job hangs forever. And this is when i have not enabled Dynamic allocation. My cluster has other DN's available, AM should request the killed executors from yarn, and get it on some other DN's. Regards, Prakhar On Mon, Oct 19, 2015 at 2:47 PM, Saisai Shao <sai.sai.s...@gmail.com> wrote: > This is a deliberate killing request by heartbeat mechanism, have nothing > to do with dynamic allocation. Here because you're running on yarn mode, so > "supportDynamicAllocation" will be true, but actually there's no relation > to dynamic allocation. > > From my understanding "doRequestTotalExecutors" is to sync the current > total executor number with AM, AM will try to cancel some pending container > requests when current expected executor number is less. The actual > container killing command is issued by "doRequestTotalExecutors". > > Not sure what is your actual problem? is it unexpected? > > Thanks > Saisai > > > On Mon, Oct 19, 2015 at 3:51 PM, prakhar jauhari <prak...@gmail.com> > wrote: > >> Hey all, >> >> Thanks in advance. I ran into a situation where spark driver reduced the >> total executors count for my job even with dynamic allocation disabled, >> and >> caused the job to hang for ever. >> >> Setup: >> Spark-1.3.1 on hadoop-yarn-2.4.0 cluster. >> All servers in cluster running Linux version 2.6.32. >> Job in yarn-client mode. >> >> Scenario: >> 1. Application running with required number of executors. >> 2. One of the DN's losses connectivity and is timed out. >> 2. Spark issues a killExecutor for the executor on the DN which was timed >> out. >> 3. Even with dynamic allocation off, spark's driver reduces the >> "targetNumExecutors". >> >> On analysing the code (Spark 1.3.1): >> >> When my DN goes unreachable: >> >> Spark core's HeartbeatReceiver invokes expireDeadHosts(): which checks if >> Dynamic Allocation is supported and then invokes "sc.killExecutor()" >> >> /if (sc.supportDynamicAllocation) { >> sc.killExecutor(executorId) >> }/ >> >> Surprisingly supportDynamicAllocation in sparkContext.scala is defined as, >> resulting "True" if dynamicAllocationTesting flag is enabled or spark is >> running over "yarn". >> >> /private[spark] def supportDynamicAllocation = >> master.contains("yarn") || dynamicAllocationTesting / >> >> "sc.killExecutor()" matches it to configured "schedulerBackend" >> (CoarseGrainedSchedulerBackend in this case) and invokes >> "killExecutors(executorIds)" >> >> CoarseGrainedSchedulerBackend calculates a "newTotal" for the total number >> of executors required, and sends a update to application master by >> invoking >> "doRequestTotalExecutors(newTotal)" >> >> CoarseGrainedSchedulerBackend then invokes a >> "doKillExecutors(filteredExecutorIds)" for the lost executors. >> >> Thus reducing the total number of executors in a host intermittently >> unreachable scenario. >> >> >> I noticed that this change to "CoarseGrainedSchedulerBackend" was >> introduced >> while fixing : https://issues.apache.org/jira/browse/SPARK-6325 >> <https://issues.apache.org/jira/browse/SPARK-6325> >> >> >> >> I am new to this code, If any of you could comment on why do we need >> "doRequestTotalExecutors" in "killExecutors" would be a great help. Also >> why >> do we have "supportDynamicAllocation" = True even if i have not enabled >> dynamic allocation. >> >> Regards, >> Prakhar. >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-driver-reducing-total-executors-count-even-when-Dynamic-Allocation-is-disabled-tp14679.html >> Sent from the Apache Spark Developers List mailing list archive at >> Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> >