[GitHub] [spark] mridulm commented on pull request #41723: [SPARK-44179][CORE]Fix the number of executors is calculated incorrctly when the task fails and it is speculated that the task is still executing

via GitHub Fri, 30 Jun 2023 20:31:32 -0700


mridulm commented on PR #41723:
URL: https://github.com/apache/spark/pull/41723#issuecomment-1615415367


   The issue here is in `TaskSetManager` (tsm), not `ExecutorAllocationManager`.
   The scenario you laid out is correct, namely:
   
   Initial state:
   * For partition P, original task T1 = 1.0  and speculative task T2 = 1.1 
   * `speculatableTasks` is empty, since 1.1 has started already.
   
   T1 fails [1]:
   * tsm.`handleFailedTask` adds the task again (`addPendingTask` at the end).
   * This in turn adds P to all the task lists 
   
   
   Assuming [1] has not yet been scheduled and we have `checkSpeculatableTasks` 
invoked.
   The condition in `checkAndSubmitSpeculatableTasks` will be true, namely : 
`!successful(index) && copiesRunning(index) == 1 && 
!speculatableTasks.contains(index)`.
   Hence we add a speculative task for schedule.
   
   This is the reason why we have 3 pending tasks - the running speculated 
task, the re-added task from failure, and the new speculative task.
   So what `ExecutorAllocationManager` is doing is IMO correct - what is 
incorrect is addition of the second speculative task itself.
   
   The fix would be to change the condition in 
`checkAndSubmitSpeculatableTasks` to something like:
   
   
   ```
     !successful(index) && copiesRunning(index) == 1 && 
!speculatableTasks.contains(index) && 
     // make sure the currently running copy is not a speculative task
     !info.speculative
   ```
   
   +CC @Ngone51 as well, since this is a bit more involved change (though a 1 
line change :-) )
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] mridulm commented on pull request #41723: [SPARK-44179][CORE]Fix the number of executors is calculated incorrctly when the task fails and it is speculated that the task is still executing

Reply via email to