Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/18874 There is nothing in the code stopping your from you idle timeouting all of your executors.. thus executors are 0 and you deadlock. 0 executors = deadlock = definite bug. We definitely want to fix that. User should not have to set a minimum number to stop a deadlock. The jira did not explicitly state you can go to 0, it states the general problem. The going to 0 is just a very bad case of the general problem. I believe a small number of executors is a bug as well. There are multiple reasons: 1) Things can happen in the system that are not expected that can cause delays. Spark should be resilient to these. If the driver is GC'ing, you have network delays, etc we could idle timeout executors even though there are tasks to run on them its just the scheduler hasn't had time to start those tasks. these just slow down the users job, the user does not want this. 2) Internal Spark components have opposing requirements. The scheduler has a requirement to try to get locality, the dynamic allocation doesn't know about this and it giving away executors it hurting the scheduler from doing what it was designed to do. Ideally we have enough executors to run all the tasks on. If dynamic allocation allows those to idle timeout the scheduler can not make proper decisions. In the end this hurts users by affects the job. A user should not have to mess with the configs to keep this basic behavior. So I guess our disagreement is in what the definition of the idle timeout is. If the system were in perfect sync and the scheduler could tell the dynamic allocation that it will never need that executor for that stage then let it timeout. We don't have that case so the dynamic allocation manager is doing something that doesn't jive with the logic in the scheduler. It specifically has logic to try to get good locality. We should not be giving away the executors while its trying to do that logic. From a user point of view this is very bad behavior and can cause the stage to take much longer. The user only wants things to idle timeout when they could never be used. This is when the # of tasks is < the number of executors * cores per executor. There are multiple ways we could fix this but the present is the most straight forward. In the very least we need to make sure it doesn't go to zero and at some point look to see if we should reacquire executors and try to let scheduler do its job again. ie after some point tasks that got locality will likely have finished and scheduler falls back more. But again that is just more complex logic. I understand how you are saying since they are idle its best to give them back, but that directly affects the scheduler and as I've seen in production causes bad things to happen. perhaps we need others to chime in. @vanzin @squito thoughts?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org