Github user kayousterhout commented on a diff in the pull request:
https://github.com/apache/spark/pull/17166#discussion_r107539016
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -467,7 +474,7 @@ private[spark] class TaskSchedulerImpl
private[scheduler](
taskState: TaskState,
reason: TaskFailedReason): Unit = synchronized {
taskSetManager.handleFailedTask(tid, taskState, reason)
- if (!taskSetManager.isZombie && taskState != TaskState.KILLED) {
+ if (!taskSetManager.isZombie) {
--- End diff --
Without this change, the job could hang: if just one task was left, and
that task got killed, I don't think reviveOffers would ever be called.
@mridulm I'm not that concerned about the extra calls to reviveOffers. In
the worse case, if every task in a job is speculated (which of course can't
actually happen), this leads to 2x the number of calls to reviveOffers -- so it
still doesn't change the asymptotic time complexity even in the worse case.
There are already a bunch of cases where we're pretty conservative with
reviveOffers, in the sense that we call it even though we might not need to
(e.g., when an executor dies, even if there aren't any tasks that need to be
run; or every time there are speculative tasks available to run, even if there
aren't any resources to run them on) so this change is in keeping with that
pattern.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]