[
https://issues.apache.org/jira/browse/SPARK-14652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15242604#comment-15242604
]
Liwei Lin commented on SPARK-14652:
-----------------------------------
hi [~weideng], this problem is probably caused by the absence of
{{DataFrame.unpersist()}} -- please call {{readingsDataFrame.unpersist()}} at
the end of {{foreachRDD}}.
If you do need {{cache()}}, I think a better way to do this is call {{cache()}}
on {{DStream}} rather than on {{DataFrame}}, i.e. {{readings.cache()}} then
{{readings.foreachRDD}}. The reason is that Spark Streaming will call
{{unpersist()}} automatically for you after each batch if you use
{{DStream.cache()}}, but won't do the same if you use {{DataFrame.cache()}}.
> pyspark streaming driver unable to cleanup metadata for cached RDDs leading
> to driver OOM
> -----------------------------------------------------------------------------------------
>
> Key: SPARK-14652
> URL: https://issues.apache.org/jira/browse/SPARK-14652
> Project: Spark
> Issue Type: Bug
> Components: PySpark, Streaming
> Affects Versions: 1.6.1
> Environment: pyspark 1.6.1
> python 2.7.6
> Ubuntu 14.04.2 LTS
> Oracle JDK 1.8.0_77
> Reporter: Wei Deng
>
> ContextCleaner was introduced in SPARK-1103 and according to its PR
> [here|https://github.com/apache/spark/pull/126]:
> {quote}
> RDD cleanup:
> {{ContextCleaner}} calls {{RDD.unpersist()}} is used to cleanup persisted
> RDDs. Regarding metadata, the DAGScheduler automatically cleans up all
> metadata related to a RDD after all jobs have completed. Only the
> {{SparkContext.persistentRDDs}} keeps strong references to persisted RDDs.
> The {{TimeStampedHashMap}} used for that has been replaced by
> {{TimeStampedWeakValueHashMap}} that keeps only weak references to the RDDs,
> allowing them to be garbage collected.
> {quote}
> However, we have observed that for a cached RDD in pyspark streaming code
> this is not the case with the current latest Spark 1.6.1 version. This is
> reflected in the forever growing number of RDDs in the {{Storage}} tab of the
> Spark Streaming application's UI once a pyspark streaming code starts to run.
> We used the
> [writemetrics.py|https://github.com/weideng1/energyiot/blob/f74d3a8b5b01639e6ff53ac461b87bb8a7b1976f/analytics/writemetrics.py]
> code to reproduce the problem, and every time after running for 20+ hours,
> the driver's JVM will start to show signs of OOM, with old gen being filled
> up and JVM stuck in full GC cycles without any old gen JVM space being freed
> up, and eventually the driver will crash with OOM.
> We have collected heap dump right before the OOM happened, and can make it
> available for analysis if it's considered as useful. However, it might be
> easier to just monitor the growth of the number of RDDs in the {{Storage}}
> tab from the Spark application's UI to confirm this is happening. To
> illustrate the problem, we also tried to set {{--conf
> spark.cleaner.periodicGC.interval=10s}} in the spark-submit command line of
> pyspark code and enabled DEBUG level logging of the driver's logback.xml and
> confirmed that even if the cleaner gets triggered as quickly as every 10
> seconds, none of the cached RDDs will be unpersisted automatically by
> ContextCleaner.
> Currently we have resorted to manually calling unpersist() to work around the
> problem. However, this goes against the spirit of SPARK-1103, i.e. automated
> garbage collection in the SparkContext.
> We also conducted a simple test with Scala code and with setting {{--conf
> spark.cleaner.periodicGC.interval=10s}}, and found the Scala code was able to
> clean up the RDDs every 10 seconds as expected, so this appears to be a
> pyspark specific issue. We suspect it has something to do with python not
> being able to pass those out of scope RDDs as weak references to the Context
> Cleaner.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]