[
https://issues.apache.org/jira/browse/SPARK-18274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joseph K. Bradley updated SPARK-18274:
--------------------------------------
Assignee: Sandeep Singh
> Memory leak in PySpark StringIndexer
> ------------------------------------
>
> Key: SPARK-18274
> URL: https://issues.apache.org/jira/browse/SPARK-18274
> Project: Spark
> Issue Type: Bug
> Components: ML, PySpark
> Affects Versions: 1.5.2, 1.6.3, 2.0.1, 2.0.2, 2.1.0
> Reporter: Jonas Amrich
> Assignee: Sandeep Singh
> Priority: Critical
>
> StringIndexerModel won't get collected by GC in Java even when deleted in
> Python. It can be reproduced by this code, which fails after couple of
> iterations (around 7 if you set driver memory to 600MB):
> {code}
> import random, string
> from pyspark.ml.feature import StringIndexer
> l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), )
> for _ in range(int(7e5))] # 700000 random strings of 10 characters
> df = spark.createDataFrame(l, ['string'])
> for i in range(50):
> indexer = StringIndexer(inputCol='string', outputCol='index')
> indexer.fit(df)
> {code}
> Explicit call to Python GC fixes the issue - following code runs fine:
> {code}
> for i in range(50):
> indexer = StringIndexer(inputCol='string', outputCol='index')
> indexer.fit(df)
> gc.collect()
> {code}
> The issue is similar to SPARK-6194 and can be probably fixed by calling jvm
> detach in model's destructor. This is implemented in
> pyspark.mlib.common.JavaModelWrapper but missing in
> pyspark.ml.wrapper.JavaWrapper. Other models in ml package may also be
> affected by this memory leak.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]