[
https://issues.apache.org/jira/browse/SPARK-18274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15656431#comment-15656431
]
Joseph K. Bradley commented on SPARK-18274:
-------------------------------------------
Adding one more TODO for this task:
The current fix for this is to put a {{__del__}} method in JavaWrapper which
releases the Java object. But that exposes another bug: copy should be
implemented within JavaParams, not JavaModel. Otherwise, JavaEvaluator (which
inherits from JavaParams) can be copied to produce multiple Python instances
(which should be treated independently) all of which link to the same Java
object. Changing one instance will then change others.
> Memory leak in PySpark StringIndexer
> ------------------------------------
>
> Key: SPARK-18274
> URL: https://issues.apache.org/jira/browse/SPARK-18274
> Project: Spark
> Issue Type: Bug
> Components: ML, PySpark
> Affects Versions: 1.5.2, 1.6.3, 2.0.1, 2.0.2, 2.1.0
> Reporter: Jonas Amrich
> Priority: Critical
>
> StringIndexerModel won't get collected by GC in Java even when deleted in
> Python. It can be reproduced by this code, which fails after couple of
> iterations (around 7 if you set driver memory to 600MB):
> {code}
> import random, string
> from pyspark.ml.feature import StringIndexer
> l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), )
> for _ in range(int(7e5))] # 700000 random strings of 10 characters
> df = spark.createDataFrame(l, ['string'])
> for i in range(50):
> indexer = StringIndexer(inputCol='string', outputCol='index')
> indexer.fit(df)
> {code}
> Explicit call to Python GC fixes the issue - following code runs fine:
> {code}
> for i in range(50):
> indexer = StringIndexer(inputCol='string', outputCol='index')
> indexer.fit(df)
> gc.collect()
> {code}
> The issue is similar to SPARK-6194 and can be probably fixed by calling jvm
> detach in model's destructor. This is implemented in
> pyspark.mlib.common.JavaModelWrapper but missing in
> pyspark.ml.wrapper.JavaWrapper. Other models in ml package may also be
> affected by this memory leak.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]