[
https://issues.apache.org/jira/browse/SPARK-35262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17472248#comment-17472248
]
Denis Krivenko commented on SPARK-35262:
----------------------------------------
[~iamelin] Could you please check/confirm the issue still exists in 3.2.0?
> Memory leak when dataset is being persisted
> -------------------------------------------
>
> Key: SPARK-35262
> URL: https://issues.apache.org/jira/browse/SPARK-35262
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.1.1
> Reporter: Igor Amelin
> Priority: Major
>
> If a Java- or Scala-application with SparkSession runs a long time and
> persists a lot of datasets, it can crash because of a memory leak.
> I've noticed the following. When we have a dataset and persist it, the
> SparkSession used to load that dataset is cloned in CacheManager, and this
> clone is added as a listener to `listenersPlusTimers` in `ListenerBus`. But
> this clone isn't removed from the list of listeners after that, e.g.
> unpersisting the dataset. If we persist a lot of datasets, the SparkSession
> is cloned and added to `ListenerBus` many times. This leads to a memory leak
> since the `listenersPlusTimers` list become very large.
> I've found out that the SparkSession is cloned is CacheManager when the
> parameters `spark.sql.sources.bucketing.autoBucketedScan.enabled` and
> `spark.sql.adaptive.enabled` are true. The first one is true by default, and
> this default behavior leads to the problem. When auto bucketed scan is
> disabled, the SparkSession isn't cloned, and there are no duplicates in
> ListenerBus, so the memory leak doesn't occur.
> Here is a small Java application to reproduce the memory leak:
> [https://github.com/iamelin/spark-memory-leak]
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]