[
https://issues.apache.org/jira/browse/SPARK-16230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shixiong Zhu updated SPARK-16230:
---------------------------------
Assignee: Tejas Patil
Fix Version/s: 2.1.0
2.0.1
> Executors self-killing after being assigned tasks while still in init
> ---------------------------------------------------------------------
>
> Key: SPARK-16230
> URL: https://issues.apache.org/jira/browse/SPARK-16230
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Reporter: Tejas Patil
> Assignee: Tejas Patil
> Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> I see this happening frequently in our prod clusters:
> * EXECUTOR:
> [CoarseGrainedExecutorBackend|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L61]
> sends request to register itself to the driver.
> * DRIVER: Registers executor and
> [replies|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L179]
> * EXECUTOR: ExecutorBackend receives ACK and [starts creating an
> Executor|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L81]
> * DRIVER: Tries to launch a task as it knows there is a new executor. Sends
> a
> [LaunchTask|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L268]
> to this new executor.
> * EXECUTOR: Executor is not init'ed (one of the reasons I have seen is
> because it was still trying to register to local external shuffle service).
> Meanwhile, receives a `LaunchTask`. [Kills
> itself|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L90]
> as Executor is not init'ed.
> The driver assumes that Executor is ready to accept tasks as soon as it is
> registered but thats not true.
> How this affects jobs / cluster:
> * We waste time + resources with these executors but they don't do any
> meaningful computation.
> * Driver thinks that the executor has started running the task but since the
> Executor has self killed, it does not tell driver (BTW: this is also another
> issue which I think could be fixed separately). Driver waits for 10 mins and
> then declares the executor dead. This adds up to the latency of the job.
> Plus, failure attempts also gets bumped up for the tasks despite the tasks
> were never started. For unlucky tasks, this might cause the job failure.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]