Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/18874
  
    
    There is nothing in the code stopping your from you idle timeouting all of 
your executors.. thus executors are 0 and you deadlock.  0 executors = deadlock 
= definite bug.   We definitely want to fix that.   User should not have to set 
a minimum number to stop a deadlock.   
    
    The jira did not explicitly state you can go to 0,  it states the general 
problem. The going to 0 is just a very bad case of the general problem. 
    
    I believe a small number of executors is a bug as well.   There are 
multiple reasons:
    
    1) Things can happen in the system that are not expected that can cause 
delays. Spark should be resilient to these. If the driver is GC'ing, you have 
network delays, etc  we could idle timeout executors even though there are 
tasks to run on them its just the scheduler hasn't had time to start those 
tasks.  these just slow down the users job, the user does not want this. 
    
    2)  Internal Spark components have opposing requirements. The scheduler has 
a requirement to try to get locality, the dynamic allocation doesn't know about 
this and it giving away executors it hurting the scheduler from doing what it 
was designed to do.
    Ideally we have enough executors to run all the tasks on.  If dynamic 
allocation allows those to idle timeout the scheduler can not make proper 
decisions. In the end this hurts users by affects the job.  A user should not 
have to mess with the configs to keep this basic behavior.
    
    So I guess our disagreement is in what the definition of the idle timeout 
is.  If the system were in perfect sync and the scheduler could tell the 
dynamic allocation that it will never need that executor for that stage then 
let it timeout. We don't have that case so the dynamic allocation manager is 
doing something that doesn't jive with the logic in the scheduler.  It 
specifically has logic to try to get good locality.   We should not be giving 
away the executors while its trying to do that logic.  From a user point of 
view this is very bad behavior and can cause the stage to take much longer.  
The user only wants things to idle timeout when they could never be used.  This 
is when the # of tasks is < the number of executors * cores per executor.  
    
    There are multiple ways we could fix this but the present is the most 
straight forward.  In the very least we need to make sure it doesn't go to zero 
and at some point look to see if we should reacquire executors and try to let 
scheduler do its job again.  ie after some point tasks that got locality will 
likely have finished and scheduler falls back more.  But again that is just 
more complex logic.
    
    I understand how you are saying since they are idle its best to give them 
back, but that directly affects the scheduler and as I've seen in production 
causes bad things to happen.
    
    perhaps we need others to chime in.  @vanzin @squito  thoughts?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to