GitHub user clockfly opened a pull request:

    https://github.com/apache/spark/pull/15056

    [SPARK-17503][Core] Fix memory leak in Memory store when unable to cache 
the whole RDD

    ## What changes were proposed in this pull request?
    
       Memory store may throws OutOfMemoryError when trying to cache a super 
big RDD that cannot fit in memory. 
       ```
       scala> sc.parallelize(1 to 10000000, 5).map(new 
Array[Long](1000)).cache().count
    
       java.lang.OutOfMemoryError: Java heap space
        at 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:24)
        at 
$line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:23)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$JoinIterator.next(Iterator.scala:232)
        at 
org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:683)
        at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
        at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684)
        at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
        at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
        at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915)
        at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
        at org.apache.spark.scheduler.Task.run(Task.scala:86)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
       ```
    
    Spark MemoryStore uses SizeTrackingVector as a temporary unrolling buffer 
to store all input values it has read so far before transferring the values to 
cache. The problem is that when the input RDD is too big for caching, the 
temporary unrolling memory SizeTrackingVector is not garbage collected in time. 
As SizeTrackingVector can occupy all available storage memory, it may cause the 
executor JVM to run out of memory quickly.
    
    ## How was this patch tested?
    
    Unit test.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/clockfly/spark memory_store_leak

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15056.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15056
    
----
commit a9a4a8b23afc64d7e2d7426b92013442308a8ea3
Author: Sean Zhong <[email protected]>
Date:   2016-09-12T07:12:48Z

    SPARK-17503: Fix memory leak in Memory store when unable to cache the whole 
RDD

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to