Matei Zaharia created SPARK-1418:
------------------------------------
Summary: Python MLlib's _get_unmangled_rdd should uncache RDDs
when training is done
Key: SPARK-1418
URL: https://issues.apache.org/jira/browse/SPARK-1418
Project: Spark
Issue Type: Improvement
Components: MLlib, PySpark
Reporter: Matei Zaharia
Right now when PySpark converts a Python RDD of NumPy vectors to a Java one, it
caches the Java one, since many of the algorithms are iterative. We should call
unpersist() at the end of the algorithm though to free cache space. In addition
it may be good to persist the Java RDD with StorageLevel.MEMORY_AND_DISK
instead of going back through the NumPy conversion.. it will almost certainly
be faster.
--
This message was sent by Atlassian JIRA
(v6.2#6252)