Patrick Wendell created SPARK-1777:
--------------------------------------
Summary: Pass "cached" blocks directly to disk if memory is not
large enough
Key: SPARK-1777
URL: https://issues.apache.org/jira/browse/SPARK-1777
Project: Spark
Issue Type: New Feature
Components: Spark Core
Reporter: Patrick Wendell
Assignee: Andrew Or
Fix For: 1.1.0
Currently in Spark we entirely unroll a partition and then check whether it
will cause us to exceed the storage limit. This has an obvious problem - if the
partition itself is enough to push us over the storage limit (and eventually
over the JVM heap), it will cause an OOM.
This can happen in cases where a single partition is very large or when someone
is running examples locally with a small heap.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/CacheManager.scala#L106
We should think a bit about the most elegant way to fix this - it shares some
similarities with the external aggregation code.
A simple idea is to periodically check the size of the buffer as we are
unrolling and see if we are over the memory limit. If we are we could prepend
the existing buffer to the iterator and write that entire thing out to disk.
--
This message was sent by Atlassian JIRA
(v6.2#6252)