Hi Prakhar, I start to know your problem, you expected that the killed exexcutor by heartbeat mechanism should be launched again but seems not. This problem I think is fixed in the version 1.5 of Spark, you could check this jira https://issues.apache.org/jira/browse/SPARK-8119
Thanks Saisai 2015年10月20日星期二,prakhar jauhari <prak...@gmail.com> 写道: > Thanks sai for the input, > > So the problem is : i start my job with some fixed number of executors, > but when a host running my executors goes unreachable, driver reduces the > total number of executors. And never increases it. > > I have a repro for the issue, attaching logs: > !!!! Running spark job is configured for 2 executors, dynamic allocation > not enabled !!!!!!! > > AM starts requesting the 2 executors: > 15/10/19 12:25:58 INFO yarn.YarnRMClient: Registering the ApplicationMaster > 15/10/19 12:25:59 INFO yarn.YarnAllocator: Will request 2 executor > containers, each with 1 cores and 1408 MB memory including 384 MB overhead > 15/10/19 12:25:59 INFO yarn.YarnAllocator: Container request (host: Any, > capability: <memory:1408, vCores:1>) > 15/10/19 12:25:59 INFO yarn.YarnAllocator: Container request (host: Any, > capability: <memory:1408, vCores:1>) > 15/10/19 12:25:59 INFO yarn.ApplicationMaster: Started progress reporter > thread - sleep time : 5000 > > Executors launched: > 15/10/19 12:26:04 INFO impl.AMRMClientImpl: Received new token for : > DN-2:58739 > 15/10/19 12:26:04 INFO impl.AMRMClientImpl: Received new token for : > DN-1:44591 > 15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching container > container_1444841612643_0014_01_000002 for on host DN-2 > 15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching ExecutorRunnable. > driverUrl: akka.tcp://sparkDriver@NN-1:35115/user/CoarseGrainedScheduler, > executorHostname: DN-2 > 15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching container > container_1444841612643_0014_01_000003 for on host DN-1 > 15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching ExecutorRunnable. > driverUrl: akka.tcp://sparkDriver@NN-1:35115/user/CoarseGrainedScheduler, > executorHostname: DN-1 > > Now my AM and executor 1 are running on DN-2, DN-1 has executor 2 running > on it. To reproduce this issue I removed IP from DN-1, until it was timed > out by spark. > 15/10/19 13:03:30 INFO yarn.YarnAllocator: Driver requested a total number > of 1 executor(s). > 15/10/19 13:03:30 INFO yarn.ApplicationMaster: Driver requested to kill > executor(s) 2. > > > So the driver has reduced the total number of executor to : 1 > And now even when the DN comes up and rejoins the cluster, this count is > not increased. > If I had executor 1 running on a separate DN (not the same as AM's DN), > and that DN went unreachable, driver would reduce total number of executor > to : 0 and the job hangs forever. And this is when i have not enabled > Dynamic allocation. My cluster has other DN's available, AM should request > the killed executors from yarn, and get it on some other DN's. > > Regards, > Prakhar > > > On Mon, Oct 19, 2015 at 2:47 PM, Saisai Shao <sai.sai.s...@gmail.com > <javascript:_e(%7B%7D,'cvml','sai.sai.s...@gmail.com');>> wrote: > >> This is a deliberate killing request by heartbeat mechanism, have nothing >> to do with dynamic allocation. Here because you're running on yarn mode, so >> "supportDynamicAllocation" will be true, but actually there's no >> relation to dynamic allocation. >> >> From my understanding "doRequestTotalExecutors" is to sync the current >> total executor number with AM, AM will try to cancel some pending container >> requests when current expected executor number is less. The actual >> container killing command is issued by "doRequestTotalExecutors". >> >> Not sure what is your actual problem? is it unexpected? >> >> Thanks >> Saisai >> >> >> On Mon, Oct 19, 2015 at 3:51 PM, prakhar jauhari <prak...@gmail.com >> <javascript:_e(%7B%7D,'cvml','prak...@gmail.com');>> wrote: >> >>> Hey all, >>> >>> Thanks in advance. I ran into a situation where spark driver reduced the >>> total executors count for my job even with dynamic allocation disabled, >>> and >>> caused the job to hang for ever. >>> >>> Setup: >>> Spark-1.3.1 on hadoop-yarn-2.4.0 cluster. >>> All servers in cluster running Linux version 2.6.32. >>> Job in yarn-client mode. >>> >>> Scenario: >>> 1. Application running with required number of executors. >>> 2. One of the DN's losses connectivity and is timed out. >>> 2. Spark issues a killExecutor for the executor on the DN which was timed >>> out. >>> 3. Even with dynamic allocation off, spark's driver reduces the >>> "targetNumExecutors". >>> >>> On analysing the code (Spark 1.3.1): >>> >>> When my DN goes unreachable: >>> >>> Spark core's HeartbeatReceiver invokes expireDeadHosts(): which checks if >>> Dynamic Allocation is supported and then invokes "sc.killExecutor()" >>> >>> /if (sc.supportDynamicAllocation) { >>> sc.killExecutor(executorId) >>> }/ >>> >>> Surprisingly supportDynamicAllocation in sparkContext.scala is defined >>> as, >>> resulting "True" if dynamicAllocationTesting flag is enabled or spark is >>> running over "yarn". >>> >>> /private[spark] def supportDynamicAllocation = >>> master.contains("yarn") || dynamicAllocationTesting / >>> >>> "sc.killExecutor()" matches it to configured "schedulerBackend" >>> (CoarseGrainedSchedulerBackend in this case) and invokes >>> "killExecutors(executorIds)" >>> >>> CoarseGrainedSchedulerBackend calculates a "newTotal" for the total >>> number >>> of executors required, and sends a update to application master by >>> invoking >>> "doRequestTotalExecutors(newTotal)" >>> >>> CoarseGrainedSchedulerBackend then invokes a >>> "doKillExecutors(filteredExecutorIds)" for the lost executors. >>> >>> Thus reducing the total number of executors in a host intermittently >>> unreachable scenario. >>> >>> >>> I noticed that this change to "CoarseGrainedSchedulerBackend" was >>> introduced >>> while fixing : https://issues.apache.org/jira/browse/SPARK-6325 >>> <https://issues.apache.org/jira/browse/SPARK-6325> >>> >>> >>> >>> I am new to this code, If any of you could comment on why do we need >>> "doRequestTotalExecutors" in "killExecutors" would be a great help. Also >>> why >>> do we have "supportDynamicAllocation" = True even if i have not enabled >>> dynamic allocation. >>> >>> Regards, >>> Prakhar. >>> >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-driver-reducing-total-executors-count-even-when-Dynamic-Allocation-is-disabled-tp14679.html >>> Sent from the Apache Spark Developers List mailing list archive at >>> Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>> <javascript:_e(%7B%7D,'cvml','dev-unsubscr...@spark.apache.org');> >>> For additional commands, e-mail: dev-h...@spark.apache.org >>> <javascript:_e(%7B%7D,'cvml','dev-h...@spark.apache.org');> >>> >>> >> >