GitHub user clockfly opened a pull request:
https://github.com/apache/spark/pull/15056
[SPARK-17503][Core] Fix memory leak in Memory store when unable to cache
the whole RDD
## What changes were proposed in this pull request?
Memory store may throws OutOfMemoryError when trying to cache a super
big RDD that cannot fit in memory.
```
scala> sc.parallelize(1 to 10000000, 5).map(new
Array[Long](1000)).cache().count
java.lang.OutOfMemoryError: Java heap space
at
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:24)
at
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:23)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$JoinIterator.next(Iterator.scala:232)
at
org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:683)
at
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
```
Spark MemoryStore uses SizeTrackingVector as a temporary unrolling buffer
to store all input values it has read so far before transferring the values to
cache. The problem is that when the input RDD is too big for caching, the
temporary unrolling memory SizeTrackingVector is not garbage collected in time.
As SizeTrackingVector can occupy all available storage memory, it may cause the
executor JVM to run out of memory quickly.
## How was this patch tested?
Unit test.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/clockfly/spark memory_store_leak
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15056.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15056
----
commit a9a4a8b23afc64d7e2d7426b92013442308a8ea3
Author: Sean Zhong <[email protected]>
Date: 2016-09-12T07:12:48Z
SPARK-17503: Fix memory leak in Memory store when unable to cache the whole
RDD
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]