Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17166#discussion_r107631487
  
    --- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
    @@ -467,7 +474,7 @@ private[spark] class TaskSchedulerImpl 
private[scheduler](
           taskState: TaskState,
           reason: TaskFailedReason): Unit = synchronized {
         taskSetManager.handleFailedTask(tid, taskState, reason)
    -    if (!taskSetManager.isZombie && taskState != TaskState.KILLED) {
    +    if (!taskSetManager.isZombie) {
    --- End diff --
    
    @kayousterhout Raising a couple of points here:
    
    a) Cost of enabling this.
    For larger jobs, the cost is high. The usual jobs I used to run were 50k 
tasks on 750 executors (outliers being upto 200k tasks on ~4500 executors).
    At 50k tasks, roughly 12.5k tasks would be speculated - which will result 
in about 12.5k ReviveOffer's processing 750 executors.
    
    If I understood the examples mentioned, they will comparatively lead to 
much smaller number of revive offers (when executor dies, periodically for spec 
exec - I had it at 1s and not 100ms iirc)
    
    
    b) Expectation and preconditions for `killTaskAttempt`.
    Currently task kill is handled internally for specific purposes - when 
stage is killed, or when there is a successful task already completed (ignoring 
test cases) - hopefully I did not miss others.
    In these cases, we do not need task re-execution.
    
    What is the precondition and expectation we are exposing when user invokes 
`killTaskAttempt` ?          
    (Also, what usecases are we trying to enable through the api ?)
    
    I agree with your example about job hang - that will occur. 
    
    I assume that preconditions and expectations remain the same as what we 
currently support. 
    That is, the task is killed and not reattempted. And the kill is performed 
judiciously by user code.
    If this does not hold, then we might have other issues as well (for 
example, output commit coordinator will break).
    
    Having said this, I am not as familiar with the scheduler code anymore 
(want to list these out before I go MIA).
    
    
    c) @ericl we should focus on what the right contract/behavior to expose 
should be and focus on implementation to satisfy the design.
    See above regarding cost. While functionality trumps cost/performance, we 
should also be cognizant of it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to