Github user tgravescs commented on the issue:
There is nothing in the code stopping your from you idle timeouting all of
your executors.. thus executors are 0 and you deadlock. 0 executors = deadlock
= definite bug. We definitely want to fix that. User should not have to set
a minimum number to stop a deadlock.
The jira did not explicitly state you can go to 0, it states the general
problem. The going to 0 is just a very bad case of the general problem.
I believe a small number of executors is a bug as well. There are
1) Things can happen in the system that are not expected that can cause
delays. Spark should be resilient to these. If the driver is GC'ing, you have
network delays, etc we could idle timeout executors even though there are
tasks to run on them its just the scheduler hasn't had time to start those
tasks. these just slow down the users job, the user does not want this.
2) Internal Spark components have opposing requirements. The scheduler has
a requirement to try to get locality, the dynamic allocation doesn't know about
this and it giving away executors it hurting the scheduler from doing what it
was designed to do.
Ideally we have enough executors to run all the tasks on. If dynamic
allocation allows those to idle timeout the scheduler can not make proper
decisions. In the end this hurts users by affects the job. A user should not
have to mess with the configs to keep this basic behavior.
So I guess our disagreement is in what the definition of the idle timeout
is. If the system were in perfect sync and the scheduler could tell the
dynamic allocation that it will never need that executor for that stage then
let it timeout. We don't have that case so the dynamic allocation manager is
doing something that doesn't jive with the logic in the scheduler. It
specifically has logic to try to get good locality. We should not be giving
away the executors while its trying to do that logic. From a user point of
view this is very bad behavior and can cause the stage to take much longer.
The user only wants things to idle timeout when they could never be used. This
is when the # of tasks is < the number of executors * cores per executor.
There are multiple ways we could fix this but the present is the most
straight forward. In the very least we need to make sure it doesn't go to zero
and at some point look to see if we should reacquire executors and try to let
scheduler do its job again. ie after some point tasks that got locality will
likely have finished and scheduler falls back more. But again that is just
more complex logic.
I understand how you are saying since they are idle its best to give them
back, but that directly affects the scheduler and as I've seen in production
causes bad things to happen.
perhaps we need others to chime in. @vanzin @squito thoughts?
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org