GitHub user vanzin opened a pull request:
https://github.com/apache/spark/pull/5018
[SPARK-6325] [core,yarn] Disallow reducing executor count past running c...
...ount.
The dynamic execution code has two ways to reduce the number of executors:
one
where it reduces the total number of executors it wants, by asking for an
absolute
number of executors that is lower than the previous one. The second is by
explicitly killing idle executors.
The problem arises becase these two paths don't really agree with each
other. The
backend allocator doesn't know when executors are idle, so it doesn't kill
running
executors when the absolute count goes below the current number of running
executors.
And when the driver explicitly kills executors, the backend will get
confused and
hit an assert.
There might be a better long-term solution, but this is the easy one:
disallow that
situation in the first place. Since the backend won't kill executors unless
explicitly
asked to, never send a request to set the number of executors to lower than
the
currently running count (and explicitly check for that in the backend).
As part of investigating this, I also found places where some minor cleanup
could
be done:
- Avoid priting warning messages about no pending executors when canceling
0 requests
- Properly unregister executors from the `addressToExecutorId` map in the
scheduler
backend; this both avoids a misleading error message and fixes a slow
memory leak.
- Avoid sending executor requests to the allocation backend when not
needed, to avoid
a flood of log messages in the AM's logs.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/vanzin/spark SPARK-6325
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/5018.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #5018
----
commit a158101b03fa609779157ac499d389b5965995ce
Author: Marcelo Vanzin <[email protected]>
Date: 2015-03-13T18:27:49Z
[SPARK-6325] [core,yarn] Disallow reducing executor count past running
count.
The dynamic execution code has two ways to reduce the number of executors:
one
where it reduces the total number of executors it wants, by asking for an
absolute
number of executors that is lower than the previous one. The second is by
explicitly killing idle executors.
The problem arises becase these two paths don't really agree with each
other. The
backend allocator doesn't know when executors are idle, so it doesn't kill
running
executors when the absolute count goes below the current number of running
executors.
And when the driver explicitly kills executors, the backend will get
confused and
hit an assert.
There might be a better long-term solution, but this is the easy one:
disallow that
situation in the first place. Since the backend won't kill executors unless
explicitly
asked to, never send a request to set the number of executors to lower than
the
currently running count (and explicitly check for that in the backend).
As part of investigating this, I also found places where some minor cleanup
could
be done:
- Avoid priting warning messages about no pending executors when canceling
0 requests
- Properly unregister executors from the `addressToExecutorId` map in the
scheduler
backend; this both avoids a misleading error message and fixes a slow
memory leak.
- Avoid sending executor requests to the allocation backend when not
needed, to avoid
a flood of log messages in the AM's logs.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]