GitHub user mingyukim opened a pull request:

    https://github.com/apache/spark/pull/4420

    [SPARK-4808] Configurable spillable memory threshold + sampling rate

    In the general case, Spillable's heuristic of checking for memory stress
    on every 32nd item after 1000 items are read is good enough. In general,
    we do not want to be enacting the spilling checks until later on in the
    job; checking for disk-spilling too early can produce unacceptable
    performance impact in trivial cases.
    
    However, there are non-trivial cases, particularly if each serialized
    object is large, where checking for the necessity to spill too late
    would allow the memory to overflow. Consider if every item is 1.5 MB in
    size, and the heap size is 1000 MB. Then clearly if we only try to spill
    the in-memory contents to disk after 1000 items are read, we would have
    already accumulated 1500 MB of RAM and overflowed the heap.
    
    Patch #3656 attempted to circumvent this by checking the need to spill
    on every single item read, but that would cause unacceptable performance
    in the general case. However, the convoluted cases above should not be
    forced to be refactored to shrink the data items. Therefore it makes
    sense that the memory spilling thresholds be configurable.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mccheah/spark memory-spill-configurable

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4420.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4420
    
----
commit 84afd105a0f27ef92c7034cbea97b74ea286232c
Author: mcheah <[email protected]>
Date:   2015-02-05T15:02:11Z

    [SPARK-4808] Configurable spillable memory threshold + sampling rate
    
    In the general case, Spillable's heuristic of checking for memory stress
    on every 32nd item after 1000 items are read is good enough. In general,
    we do not want to be enacting the spilling checks until later on in the
    job; checking for disk-spilling too early can produce unacceptable
    performance impact in trivial cases.
    
    However, there are non-trivial cases, particularly if each serialized
    object is large, where checking for the necessity to spill too late
    would allow the memory to overflow. Consider if every item is 1.5 MB in
    size, and the heap size is 1000 MB. Then clearly if we only try to spill
    the in-memory contents to disk after 1000 items are read, we would have
    already accumulated 1500 MB of RAM and overflowed the heap.
    
    Patch #3656 attempted to circumvent this by checking the need to spill
    on every single item read, but that would cause unacceptable performance
    in the general case. However, the convoluted cases above should not be
    forced to be refactored to shrink the data items. Therefore it makes
    sense that the memory spilling thresholds be configurable.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to