Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8760#discussion_r43786312
  
    --- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
    @@ -83,8 +74,6 @@ private[spark] class TaskSetManager(
       val copiesRunning = new Array[Int](numTasks)
       val successful = new Array[Boolean](numTasks)
       private val numFailures = new Array[Int](numTasks)
    -  // key is taskId, value is a Map of executor id to when it failed
    --- End diff --
    
    ah, I'm glad you pointed this out, b/c this is actually a really important 
difference.  So I think this change is broken in the way it works across 
multiple task sets.  In the original implementation, this distinction didn't 
matter so much because all of the bookkeeping was kept within one 
TaskSetManager.  However, now that BlacklistTracker is global, it being used 
for multiple stages, where a task index refers to completely different things.  
Eg., you might have a failure in task 1 of stage 1, and another failure of task 
1 in stage 2, but those are totally different tasks.  so the default strategy 
is doing something different than before.
    
    I think the general solution here is for all the `BlacklistXXX` classes to 
always refer to both a `stageId` and `taskId`, in the method arguments and also 
the bookkeeping.  You should add a test case for this as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to