squito commented on issue #24497: [SPARK-27630][CORE]Stage retry causes totalRunningTasks calculation to be negative URL: https://github.com/apache/spark/pull/24497#issuecomment-491429592 I dug into the history of this a bit -- its a bit scattered across multiple PRs for https://issues.apache.org/jira/browse/SPARK-11334, and work done under https://issues.apache.org/jira/browse/SPARK-11701 and elsewhere. It seems to me like we want to undo one key part of the SPARK-11334 -- where we remove an entry from `stageIdToNumRunningTasks` when we get a stageCompleted: https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L698 part of that change was to defend against a missing TaskEnd event. But I think we just have to make sure those events are delivered, I don't see a way for this to work otherwise*. You can't clear the count-per-stage(attempt) on stageEnd, because the stage can still have tasks running after a stageCompleted event. I'm not sure if it matters if you track running tasks per stage or stage attempt, I think it might not. One thing we'll have to figure out is when you clear entries from `stageIdToNumRunningTask` to avoid it growing indefinitely -- I guess you can remove if the numRunningTasks is 0 AND its been cleared from the other structures. @cxzl25 would you like to try to make this change? This is a complex part which will require some careful review. * there is the problem that the EventBus drops events when its full, but hopefully that's been improved significantly recently. And anyway, it could drop a StageCompleted as easily as it could drop a TaskEnd.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
