Matei Zaharia created SPARK-1418:
------------------------------------

             Summary: Python MLlib's _get_unmangled_rdd should uncache RDDs 
when training is done
                 Key: SPARK-1418
                 URL: https://issues.apache.org/jira/browse/SPARK-1418
             Project: Spark
          Issue Type: Improvement
          Components: MLlib, PySpark
            Reporter: Matei Zaharia


Right now when PySpark converts a Python RDD of NumPy vectors to a Java one, it 
caches the Java one, since many of the algorithms are iterative. We should call 
unpersist() at the end of the algorithm though to free cache space. In addition 
it may be good to persist the Java RDD with StorageLevel.MEMORY_AND_DISK 
instead of going back through the NumPy conversion.. it will almost certainly 
be faster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to