Github user squito commented on a diff in the pull request:
https://github.com/apache/spark/pull/22288#discussion_r227067905
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -415,9 +420,55 @@ private[spark] class TaskSchedulerImpl(
launchedAnyTask |= launchedTaskAtCurrentMaxLocality
} while (launchedTaskAtCurrentMaxLocality)
}
+
if (!launchedAnyTask) {
- taskSet.abortIfCompletelyBlacklisted(hostToExecutors)
+ taskSet.getCompletelyBlacklistedTaskIfAny(hostToExecutors) match
{
+ case Some(taskIndex) => // Returns the taskIndex which was
unschedulable
+
+ // If the taskSet is unschedulable we try to find an
existing idle blacklisted
+ // executor. If we cannot find one, we abort immediately.
Else we kill the idle
--- End diff --
I'm a little worried that the idle condition will be too strict in some
scenarios, if there is a large backlog of tasks from another taskset, or
whatever the error is, the tasks take a while to fail (eg., you've really got a
bad executor, but its not apparent till after network timeouts or something).
Eg. that could happen if you're doing a big join, and while preparing the input
on the map-side, one side just has one straggler left but the other side still
has a big backlog of tasks. Or, in a jobserver style situation, and there are
always other tasksets coming in.
that said, I don't have any better ideas at the moment, and I still think
this is an improvement.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]