GitHub user mingyukim reopened a pull request:
https://github.com/apache/spark/pull/4420
[SPARK-4808] Removing minimum number of elements read before spill check
In the general case, Spillable's heuristic of checking for memory stress
on every 32nd item after 1000 items are read is good enough. In general,
we do not want to be enacting the spilling checks until later on in the
job; checking for disk-spilling too early can produce unacceptable
performance impact in trivial cases.
However, there are non-trivial cases, particularly if each serialized
object is large, where checking for the necessity to spill too late
would allow the memory to overflow. Consider if every item is 1.5 MB in
size, and the heap size is 1000 MB. Then clearly if we only try to spill
the in-memory contents to disk after 1000 items are read, we would have
already accumulated 1500 MB of RAM and overflowed the heap.
Patch #3656 attempted to circumvent this by checking the need to spill
on every single item read, but that would cause unacceptable performance
in the general case. However, the convoluted cases above should not be
forced to be refactored to shrink the data items. Therefore it makes
sense that the memory spilling thresholds be configurable.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mccheah/spark memory-spill-configurable
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/4420.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #4420
----
commit 6e2509f060c1acb6706e575b66fb953dba9a424c
Author: mcheah <[email protected]>
Date: 2015-02-20T00:43:26Z
[SPARK-4808] Removing minimum number of elements read before spill check
We found that if we only start checking for spilling after reading 1000
elements in a Spillable, we would run out of memory if there are fewer
than 1000 elements but each of those elements are very large.
There is no real need to only check for spilling after reading 1000
things. It is still necessary, however, to mitigate the cost of entering
a synchronized block. It turns out in practice however checking for
spilling only on every 32 items is sufficient, without needing the
minimum elements threshold.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]