Re: Spark driver reducing total executors count even when Dynamic Allocation is disabled.

prakhar jauhari Tue, 20 Oct 2015 06:30:53 -0700

Thanks sai for the input,

So the problem is : i start my job with some fixed number of executors, but
when a host running my executors goes unreachable, driver reduces the total
number of executors. And never increases it.


I have a repro for the issue, attaching logs:
!!!! Running spark job is configured for 2 executors, dynamic allocation
not enabled !!!!!!!

AM starts requesting the 2 executors:
15/10/19 12:25:58 INFO yarn.YarnRMClient: Registering the ApplicationMaster
15/10/19 12:25:59 INFO yarn.YarnAllocator: Will request 2 executor
containers, each with 1 cores and 1408 MB memory including 384 MB overhead
15/10/19 12:25:59 INFO yarn.YarnAllocator: Container request (host: Any,
capability: <memory:1408, vCores:1>)
15/10/19 12:25:59 INFO yarn.YarnAllocator: Container request (host: Any,
capability: <memory:1408, vCores:1>)
15/10/19 12:25:59 INFO yarn.ApplicationMaster: Started progress reporter
thread - sleep time : 5000

Executors launched:
15/10/19 12:26:04 INFO impl.AMRMClientImpl: Received new token for :
DN-2:58739
15/10/19 12:26:04 INFO impl.AMRMClientImpl: Received new token for :
DN-1:44591
15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching container
container_1444841612643_0014_01_000002 for on host DN-2
15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching ExecutorRunnable.
driverUrl: akka.tcp://sparkDriver@NN-1:35115/user/CoarseGrainedScheduler,
executorHostname: DN-2
15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching container
container_1444841612643_0014_01_000003 for on host DN-1
15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching ExecutorRunnable.
driverUrl: akka.tcp://sparkDriver@NN-1:35115/user/CoarseGrainedScheduler,
executorHostname: DN-1

Now my AM and executor 1 are running on DN-2, DN-1 has executor 2 running
on it. To reproduce this issue I removed IP from DN-1, until it was timed
out by spark.
15/10/19 13:03:30 INFO yarn.YarnAllocator: Driver requested a total number
of 1 executor(s).
15/10/19 13:03:30 INFO yarn.ApplicationMaster: Driver requested to kill
executor(s) 2.


So the driver has reduced the total number of executor to : 1
And now even when the DN comes up and rejoins the cluster, this count is
not increased.
If I had executor 1 running on a separate DN (not the same as AM's DN), and
that DN went unreachable, driver would reduce total number of executor to :
0 and the job hangs forever. And this is when i have not enabled Dynamic
allocation. My cluster has other DN's available, AM should request the
killed executors from yarn, and get it on some other DN's.

Regards,
Prakhar


On Mon, Oct 19, 2015 at 2:47 PM, Saisai Shao <sai.sai.s...@gmail.com> wrote:

> This is a deliberate killing request by heartbeat mechanism, have nothing
> to do with dynamic allocation. Here because you're running on yarn mode, so
> "supportDynamicAllocation" will be true, but actually there's no relation
> to dynamic allocation.
>
> From my understanding "doRequestTotalExecutors" is to sync the current
> total executor number with AM, AM will try to cancel some pending container
> requests when current expected executor number is less. The actual
> container killing command is issued by "doRequestTotalExecutors".
>
> Not sure what is your actual problem? is it unexpected?
>
> Thanks
> Saisai
>
>
> On Mon, Oct 19, 2015 at 3:51 PM, prakhar jauhari <prak...@gmail.com>
> wrote:
>
>> Hey all,
>>
>> Thanks in advance. I ran into a situation where spark driver reduced the
>> total executors count for my job even with dynamic allocation disabled,
>> and
>> caused the job to hang for ever.
>>
>> Setup:
>> Spark-1.3.1 on hadoop-yarn-2.4.0 cluster.
>> All servers in cluster running Linux version 2.6.32.
>> Job in yarn-client mode.
>>
>> Scenario:
>> 1. Application running with required number of executors.
>> 2. One of the DN's losses connectivity and is timed out.
>> 2. Spark issues a killExecutor for the executor on the DN which was timed
>> out.
>> 3. Even with dynamic allocation off, spark's driver reduces the
>> "targetNumExecutors".
>>
>> On analysing the code (Spark 1.3.1):
>>
>> When my DN goes unreachable:
>>
>> Spark core's HeartbeatReceiver invokes expireDeadHosts(): which checks if
>> Dynamic Allocation is supported and then invokes "sc.killExecutor()"
>>
>>         /if (sc.supportDynamicAllocation) {
>>                 sc.killExecutor(executorId)
>>         }/
>>
>> Surprisingly supportDynamicAllocation in sparkContext.scala is defined as,
>> resulting "True" if dynamicAllocationTesting flag is enabled or spark is
>> running over "yarn".
>>
>> /private[spark] def supportDynamicAllocation =
>>                     master.contains("yarn") || dynamicAllocationTesting /
>>
>> "sc.killExecutor()" matches it to configured "schedulerBackend"
>> (CoarseGrainedSchedulerBackend in this case) and invokes
>> "killExecutors(executorIds)"
>>
>> CoarseGrainedSchedulerBackend calculates a "newTotal" for the total number
>> of executors required, and sends a update to application master by
>> invoking
>> "doRequestTotalExecutors(newTotal)"
>>
>> CoarseGrainedSchedulerBackend then invokes a
>> "doKillExecutors(filteredExecutorIds)" for the lost executors.
>>
>> Thus reducing the total number of executors in a host intermittently
>> unreachable scenario.
>>
>>
>> I noticed that this change to "CoarseGrainedSchedulerBackend" was
>> introduced
>> while fixing :  https://issues.apache.org/jira/browse/SPARK-6325
>> <https://issues.apache.org/jira/browse/SPARK-6325>
>>
>>
>>
>> I am new to this code, If any of you could comment on why do we need
>> "doRequestTotalExecutors" in "killExecutors" would be a great help. Also
>> why
>> do we have "supportDynamicAllocation" = True even if i have not enabled
>> dynamic allocation.
>>
>> Regards,
>> Prakhar.
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-driver-reducing-total-executors-count-even-when-Dynamic-Allocation-is-disabled-tp14679.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>

Re: Spark driver reducing total executors count even when Dynamic Allocation is disabled.

Reply via email to