[
https://issues.apache.org/jira/browse/SPARK-22148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-22148:
------------------------------------
Assignee: (was: Apache Spark)
> TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current
> executors are blacklisted but dynamic allocation is enabled
> -----------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-22148
> URL: https://issues.apache.org/jira/browse/SPARK-22148
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.2.0
> Reporter: Juan RodrĂguez Hortalá
> Attachments: SPARK-22148_WIP.diff
>
>
> Currently TaskSetManager.abortIfCompletelyBlacklisted aborts the TaskSet and
> the whole Spark job with `task X (partition Y) cannot run anywhere due to
> node and executor blacklist. Blacklisting behavior can be configured via
> spark.blacklist.*.` when all the available executors are blacklisted for a
> pending Task or TaskSet. This makes sense for static allocation, where the
> set of executors is fixed for the duration of the application, but this might
> lead to unnecessary job failures when dynamic allocation is enabled. For
> example, in a Spark application with a single job at a time, when a node
> fails at the end of a stage attempt, all other executors will complete their
> tasks, but the tasks running in the executors of the failing node will be
> pending. Spark will keep waiting for those tasks for 2 minutes by default
> (spark.network.timeout) until the heartbeat timeout is triggered, and then it
> will blacklist those executors for that stage. At that point in time, other
> executors would had been released after being idle for 1 minute by default
> (spark.dynamicAllocation.executorIdleTimeout), because the next stage hasn't
> started yet and so there are no more tasks available (assuming the default of
> spark.speculation = false). So Spark will fail because the only executors
> available are blacklisted for that stage.
> An alternative is requesting more executors to the cluster manager in this
> situation. This could be retried a configurable number of times after a
> configurable wait time between request attempts, so if the cluster manager
> fails to provide a suitable executor then the job is aborted like in the
> previous case.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]