squito commented on issue #24497: [SPARK-27630][CORE]Stage retry causes totalRunningTasks calculation to be negative URL: https://github.com/apache/spark/pull/24497#issuecomment-489706450 the problem is, the `StageCompleted` for a stage retry comes before all those tasks finish ... but meanwhile when that happens for a task failing 4 times, we intentionally want to set the stage total to 0 for SPARK-11334. Those two seem to conflict each other. Though to be honest, I don't understand why that fix was used in SPARK-11334 -- it seems like you should just wait for the task end events, and not just any `numRunning` counts just on `StageCompleted` events. I'm amazed this hasn't been reported before, it must be causing dynamic allocation to perform poorly on any stage retry. I'd like to make sure we understand this properly and look a bit more at the history here. @tgravescs @abellina I think you'll be interested in this one too.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
