Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/16867#discussion_r104516899
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -740,6 +743,7 @@ private[spark] class TaskSetManager(
}
removeRunningTask(tid)
info.markFinished(state)
+ successfulTaskDurations.remove(taskInfos(tid).duration)
--- End diff --
Consider a scenario like this, particularly with pre-emption happens :
* 10k tasks to be run, 9.5k finish on 500 executors - speculative execution
triggered for remaining tasks.
* Cluster load spike, 400 or the 500 executors reclaimed.
* We will reexecute all the failed tasks - but those tasks are also now
candidates for speculative execution since it will meet the constraints (after
this PR).
In the scenario above, with existing code, the 500 spec tasks which are
already running wont be stopped - but new ones wont be launched until they meet
the criterion again.
You are right that the median duration computation itself is probably not
invalid (it could, but we can ignore that complexity here) - what gets affected
is determining when to (re-)enable speculative execution of tasks from the set,
which gets affected.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]