GitHub user mingyukim reopened a pull request:

    https://github.com/apache/spark/pull/4420

    [SPARK-4808] Removing minimum number of elements read before spill check

    In the general case, Spillable's heuristic of checking for memory stress
    on every 32nd item after 1000 items are read is good enough. In general,
    we do not want to be enacting the spilling checks until later on in the
    job; checking for disk-spilling too early can produce unacceptable
    performance impact in trivial cases.
    
    However, there are non-trivial cases, particularly if each serialized
    object is large, where checking for the necessity to spill too late
    would allow the memory to overflow. Consider if every item is 1.5 MB in
    size, and the heap size is 1000 MB. Then clearly if we only try to spill
    the in-memory contents to disk after 1000 items are read, we would have
    already accumulated 1500 MB of RAM and overflowed the heap.
    
    Patch #3656 attempted to circumvent this by checking the need to spill
    on every single item read, but that would cause unacceptable performance
    in the general case. However, the convoluted cases above should not be
    forced to be refactored to shrink the data items. Therefore it makes
    sense that the memory spilling thresholds be configurable.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mccheah/spark memory-spill-configurable

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4420.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4420
    
----
commit 6e2509f060c1acb6706e575b66fb953dba9a424c
Author: mcheah <[email protected]>
Date:   2015-02-20T00:43:26Z

    [SPARK-4808] Removing minimum number of elements read before spill check
    
    We found that if we only start checking for spilling after reading 1000
    elements in a Spillable, we would run out of memory if there are fewer
    than 1000 elements but each of those elements are very large.
    
    There is no real need to only check for spilling after reading 1000
    things. It is still necessary, however, to mitigate the cost of entering
    a synchronized block. It turns out in practice however checking for
    spilling only on every 32 items is sufficient, without needing the
    minimum elements threshold.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to