GitHub user vanzin opened a pull request:

    https://github.com/apache/spark/pull/5018

    [SPARK-6325] [core,yarn] Disallow reducing executor count past running c...

    ...ount.
    
    The dynamic execution code has two ways to reduce the number of executors: 
one
    where it reduces the total number of executors it wants, by asking for an 
absolute
    number of executors that is lower than the previous one. The second is by
    explicitly killing idle executors.
    
    The problem arises becase these two paths don't really agree with each 
other. The
    backend allocator doesn't know when executors are idle, so it doesn't kill 
running
    executors when the absolute count goes below the current number of running 
executors.
    And when the driver explicitly kills executors, the backend will get 
confused and
    hit an assert.
    
    There might be a better long-term solution, but this is the easy one: 
disallow that
    situation in the first place. Since the backend won't kill executors unless 
explicitly
    asked to, never send a request to set the number of executors to lower than 
the
    currently running count (and explicitly check for that in the backend).
    
    As part of investigating this, I also found places where some minor cleanup 
could
    be done:
    - Avoid priting warning messages about no pending executors when canceling 
0 requests
    - Properly unregister executors from the `addressToExecutorId` map in the 
scheduler
      backend; this both avoids a misleading error message and fixes a slow 
memory leak.
    - Avoid sending executor requests to the allocation backend when not 
needed, to avoid
      a flood of log messages in the AM's logs.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vanzin/spark SPARK-6325

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5018.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5018
    
----
commit a158101b03fa609779157ac499d389b5965995ce
Author: Marcelo Vanzin <[email protected]>
Date:   2015-03-13T18:27:49Z

    [SPARK-6325] [core,yarn] Disallow reducing executor count past running 
count.
    
    The dynamic execution code has two ways to reduce the number of executors: 
one
    where it reduces the total number of executors it wants, by asking for an 
absolute
    number of executors that is lower than the previous one. The second is by
    explicitly killing idle executors.
    
    The problem arises becase these two paths don't really agree with each 
other. The
    backend allocator doesn't know when executors are idle, so it doesn't kill 
running
    executors when the absolute count goes below the current number of running 
executors.
    And when the driver explicitly kills executors, the backend will get 
confused and
    hit an assert.
    
    There might be a better long-term solution, but this is the easy one: 
disallow that
    situation in the first place. Since the backend won't kill executors unless 
explicitly
    asked to, never send a request to set the number of executors to lower than 
the
    currently running count (and explicitly check for that in the backend).
    
    As part of investigating this, I also found places where some minor cleanup 
could
    be done:
    - Avoid priting warning messages about no pending executors when canceling 
0 requests
    - Properly unregister executors from the `addressToExecutorId` map in the 
scheduler
      backend; this both avoids a misleading error message and fixes a slow 
memory leak.
    - Avoid sending executor requests to the allocation backend when not 
needed, to avoid
      a flood of log messages in the AM's logs.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to