Github user mridulm commented on a diff in the pull request:
https://github.com/apache/spark/pull/17166#discussion_r107631487
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -467,7 +474,7 @@ private[spark] class TaskSchedulerImpl
private[scheduler](
taskState: TaskState,
reason: TaskFailedReason): Unit = synchronized {
taskSetManager.handleFailedTask(tid, taskState, reason)
- if (!taskSetManager.isZombie && taskState != TaskState.KILLED) {
+ if (!taskSetManager.isZombie) {
--- End diff --
@kayousterhout Raising a couple of points here:
a) Cost of enabling this.
For larger jobs, the cost is high. The usual jobs I used to run were 50k
tasks on 750 executors (outliers being upto 200k tasks on ~4500 executors).
At 50k tasks, roughly 12.5k tasks would be speculated - which will result
in about 12.5k ReviveOffer's processing 750 executors.
If I understood the examples mentioned, they will comparatively lead to
much smaller number of revive offers (when executor dies, periodically for spec
exec - I had it at 1s and not 100ms iirc)
b) Expectation and preconditions for `killTaskAttempt`.
Currently task kill is handled internally for specific purposes - when
stage is killed, or when there is a successful task already completed (ignoring
test cases) - hopefully I did not miss others.
In these cases, we do not need task re-execution.
What is the precondition and expectation we are exposing when user invokes
`killTaskAttempt` ?
(Also, what usecases are we trying to enable through the api ?)
I agree with your example about job hang - that will occur.
I assume that preconditions and expectations remain the same as what we
currently support.
That is, the task is killed and not reattempted. And the kill is performed
judiciously by user code.
If this does not hold, then we might have other issues as well (for
example, output commit coordinator will break).
Having said this, I am not as familiar with the scheduler code anymore
(want to list these out before I go MIA).
c) @ericl we should focus on what the right contract/behavior to expose
should be and focus on implementation to satisfy the design.
See above regarding cost. While functionality trumps cost/performance, we
should also be cognizant of it.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]