squito commented on issue #24497: [SPARK-27630][CORE]Stage retry causes 
totalRunningTasks calculation to be negative
URL: https://github.com/apache/spark/pull/24497#issuecomment-491429592
 
 
   I dug into the history of this a bit -- its a bit scattered across multiple 
PRs for https://issues.apache.org/jira/browse/SPARK-11334, and work done under 
https://issues.apache.org/jira/browse/SPARK-11701 and elsewhere.
   
   It seems to me like we want to undo one key part of the SPARK-11334 -- where 
we remove an entry from `stageIdToNumRunningTasks` when we get a stageCompleted:
   
   
https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L698
   
   part of that change was to defend against a missing TaskEnd event.  But I 
think we just have to make sure those events are delivered, I don't see a way 
for this to work otherwise*. You can't clear the count-per-stage(attempt) on 
stageEnd, because the stage can still have tasks running after a stageCompleted 
event.  I'm not sure if it matters if you track running tasks per stage or 
stage attempt, I think it might not.  One thing we'll have to figure out is 
when you clear entries from `stageIdToNumRunningTask` to avoid it growing 
indefinitely -- I guess you can remove if the numRunningTasks is 0 AND its been 
cleared from the other structures.
   
   @cxzl25 would you like to try to make this change?  This is a complex part 
which will require some careful review.
   
   * there is the problem that the EventBus drops events when its full, but 
hopefully that's been improved significantly recently.  And anyway, it could 
drop a StageCompleted as easily as it could drop a TaskEnd.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to