[GitHub] spark pull request: [SPARK-14560] Spillables can be forced to spil...

squito Wed, 13 Apr 2016 14:19:35 -0700

GitHub user squito opened a pull request:

    https://github.com/apache/spark/pull/12369


    [SPARK-14560] Spillables can be forced to spill after inserting all data, 
to avoid OOM

    ## What changes were proposed in this pull request?
    
    This adds a new configuration, `spark.shuffle.spillAfterRead`, which can be 
used to force `Spillable`s to spill their contents after all records have been 
inserted.  The default is false, to keep previous behavior and avoid a 
performance penalty when unnecessary.  However this needed in cases to prevent 
an OOM when one `Spillable` acquires all of the execution memory available for 
a task, thus leaving no memory available for any other operations in the same 
task.
    
    This also required some small refactoring of `Spillable` to support a 
forced spill from an external request (as opposed to not having enough memory 
as records are added).
    
    I was initially hoping to limit the places where we needed to spill -- I 
thought that it would only be in a `ShuffleMapTask` which also does a 
shuffle-read. In that case there is clearly a `Spillable` on both the 
shuffle-read and shuffle-write side.  However, I realized this wasn't 
sufficient -- there are other cases when you can have multiple `Spillable`s, eg 
if you use a common partitioner across several aggregations, which will get 
pipelined into one stage.
    
    This also makes `Spillable`s register themselves as `MemoryConsumer`s with 
the `TaskMemoryManager`.  Note that this does *not* lead to cooperative memory 
management for `Spillable`s -- the only reason for this is to improve the 
logging around memory usage.  Before this change, there would be messages like:
    
    ```
    INFO memory.TaskMemoryManager: 217903352 bytes of memory were used by task 
2109 but are not associated with specific consumers
    ```
    
    instead, with this change the logs report the memory as being associated 
with the corresponing `Spillable`
    
    ## How was this patch tested?
    * unit tests were added to reproduce the original problem
    * jenkins unit tests
    * also independently ran some large workloads which were consistently 
getting an OOM before this change, which now pass

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/squito/spark config_SPARK-14560

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12369.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12369
    
----
commit 6d50ef94170a90db23a49feeb1f95621826539a3
Author: Imran Rashid <[email protected]>
Date:   2016-04-13T17:50:33Z

    SpillableMemoryConsumer, just for clearer reporting

commit 14da2f7204e5d7ea04a466033330e6ceb75cad4b
Author: Imran Rashid <[email protected]>
Date:   2016-04-13T17:57:35Z

    failing test cases

commit b74f215ceeabcd5e401f18478a7fd3954cad48ea
Author: Imran Rashid <[email protected]>
Date:   2016-04-13T18:55:43Z

    add conf to force spilling after inserting all records into Spillable

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-14560] Spillables can be forced to spil...

Reply via email to