[jira] [Commented] (SPARK-14652) pyspark streaming driver unable to cleanup metadata for cached RDDs leading to driver OOM

Liwei Lin (JIRA) Fri, 15 Apr 2016 00:53:38 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-14652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15242604#comment-15242604
 ]


Liwei Lin commented on SPARK-14652:
-----------------------------------

hi [~weideng], this problem is probably caused by the absence of 
{{DataFrame.unpersist()}} -- please call {{readingsDataFrame.unpersist()}} at 
the end of {{foreachRDD}}.


If you do need {{cache()}}, I think a better way to do this is call {{cache()}} 
on {{DStream}} rather than on {{DataFrame}}, i.e. {{readings.cache()}} then 
{{readings.foreachRDD}}. The reason is that Spark Streaming will call 
{{unpersist()}} automatically for you after each batch if you use 
{{DStream.cache()}}, but won't do the same if you use {{DataFrame.cache()}}.

> pyspark streaming driver unable to cleanup metadata for cached RDDs leading 
> to driver OOM
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-14652
>                 URL: https://issues.apache.org/jira/browse/SPARK-14652
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Streaming
>    Affects Versions: 1.6.1
>         Environment: pyspark 1.6.1
> python 2.7.6
> Ubuntu 14.04.2 LTS
> Oracle JDK 1.8.0_77
>            Reporter: Wei Deng
>
> ContextCleaner was introduced in SPARK-1103 and according to its PR 
> [here|https://github.com/apache/spark/pull/126]:
> {quote}
> RDD cleanup:
> {{ContextCleaner}} calls {{RDD.unpersist()}} is used to cleanup persisted 
> RDDs. Regarding metadata, the DAGScheduler automatically cleans up all 
> metadata related to a RDD after all jobs have completed. Only the 
> {{SparkContext.persistentRDDs}} keeps strong references to persisted RDDs. 
> The {{TimeStampedHashMap}} used for that has been replaced by 
> {{TimeStampedWeakValueHashMap}} that keeps only weak references to the RDDs, 
> allowing them to be garbage collected.
> {quote}
> However, we have observed that for a cached RDD in pyspark streaming code 
> this is not the case with the current latest Spark 1.6.1 version. This is 
> reflected in the forever growing number of RDDs in the {{Storage}} tab of the 
> Spark Streaming application's UI once a pyspark streaming code starts to run. 
> We used the 
> [writemetrics.py|https://github.com/weideng1/energyiot/blob/f74d3a8b5b01639e6ff53ac461b87bb8a7b1976f/analytics/writemetrics.py]
>  code to reproduce the problem, and every time after running for 20+ hours, 
> the driver's JVM will start to show signs of OOM, with old gen being filled 
> up and JVM stuck in full GC cycles without any old gen JVM space being freed 
> up, and eventually the driver will crash with OOM.
> We have collected heap dump right before the OOM happened, and can make it 
> available for analysis if it's considered as useful. However, it might be 
> easier to just monitor the growth of the number of RDDs in the {{Storage}} 
> tab from the Spark application's UI to confirm this is happening. To 
> illustrate the problem, we also tried to set {{--conf 
> spark.cleaner.periodicGC.interval=10s}} in the spark-submit command line of 
> pyspark code and enabled DEBUG level logging of the driver's logback.xml and 
> confirmed that even if the cleaner gets triggered as quickly as every 10 
> seconds, none of the cached RDDs will be unpersisted automatically by 
> ContextCleaner.
> Currently we have resorted to manually calling unpersist() to work around the 
> problem. However, this goes against the spirit of SPARK-1103, i.e. automated 
> garbage collection in the SparkContext.
> We also conducted a simple test with Scala code and with setting {{--conf 
> spark.cleaner.periodicGC.interval=10s}}, and found the Scala code was able to 
> clean up the RDDs every 10 seconds as expected, so this appears to be a 
> pyspark specific issue. We suspect it has something to do with python not 
> being able to pass those out of scope RDDs as weak references to the Context 
> Cleaner.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-14652) pyspark streaming driver unable to cleanup metadata for cached RDDs leading to driver OOM

Reply via email to