GitHub user squito opened a pull request:
https://github.com/apache/spark/pull/12369
[SPARK-14560] Spillables can be forced to spill after inserting all data,
to avoid OOM
## What changes were proposed in this pull request?
This adds a new configuration, `spark.shuffle.spillAfterRead`, which can be
used to force `Spillable`s to spill their contents after all records have been
inserted. The default is false, to keep previous behavior and avoid a
performance penalty when unnecessary. However this needed in cases to prevent
an OOM when one `Spillable` acquires all of the execution memory available for
a task, thus leaving no memory available for any other operations in the same
task.
This also required some small refactoring of `Spillable` to support a
forced spill from an external request (as opposed to not having enough memory
as records are added).
I was initially hoping to limit the places where we needed to spill -- I
thought that it would only be in a `ShuffleMapTask` which also does a
shuffle-read. In that case there is clearly a `Spillable` on both the
shuffle-read and shuffle-write side. However, I realized this wasn't
sufficient -- there are other cases when you can have multiple `Spillable`s, eg
if you use a common partitioner across several aggregations, which will get
pipelined into one stage.
This also makes `Spillable`s register themselves as `MemoryConsumer`s with
the `TaskMemoryManager`. Note that this does *not* lead to cooperative memory
management for `Spillable`s -- the only reason for this is to improve the
logging around memory usage. Before this change, there would be messages like:
```
INFO memory.TaskMemoryManager: 217903352 bytes of memory were used by task
2109 but are not associated with specific consumers
```
instead, with this change the logs report the memory as being associated
with the corresponing `Spillable`
## How was this patch tested?
* unit tests were added to reproduce the original problem
* jenkins unit tests
* also independently ran some large workloads which were consistently
getting an OOM before this change, which now pass
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/squito/spark config_SPARK-14560
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/12369.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #12369
----
commit 6d50ef94170a90db23a49feeb1f95621826539a3
Author: Imran Rashid <[email protected]>
Date: 2016-04-13T17:50:33Z
SpillableMemoryConsumer, just for clearer reporting
commit 14da2f7204e5d7ea04a466033330e6ceb75cad4b
Author: Imran Rashid <[email protected]>
Date: 2016-04-13T17:57:35Z
failing test cases
commit b74f215ceeabcd5e401f18478a7fd3954cad48ea
Author: Imran Rashid <[email protected]>
Date: 2016-04-13T18:55:43Z
add conf to force spilling after inserting all records into Spillable
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]