squito commented on issue #26614: [SPARK-29976] New conf for single task stage speculation URL: https://github.com/apache/spark/pull/26614#issuecomment-558271210 yeah I am also torn like @tgravescs . There are a ton of corner cases. I really don't like special casing one task, but I'm not sure of the clean way to configure this. Even with 10 tasks, if you've got 4 cores per executor you've easily got 4 tasks stuck on your one bad executor, and with a default speculation quantile of 0.75 you wouldn't finish 8 tasks successfully to start speculation. If you add in the fact that the poor performance may be across an entire node, and 64 cores per node is not uncommon, the limit goes way higher. speculative execution is always a heuristic, we know its not going to be perfect. I feel like when you enable speculation, you are saying you're willing to accept some wasted resources, so its more acceptable to run some speculative tasks when you don't really need to. But how much waste is OK? In Tom's exxample, say you had 10k tasks that each took an hour, but all are actually running fine -- the waste is pretty serious, you'll launch a speculative version of each task so its 10k cpu-hours wasted. One alternative might be to only have this kick in when all tasks are running on the same host (the TSM already knows the hosts of the running task, its in `TaskInfo`, it would be easy to see if there is just one host used across all tasks).
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
