[jira] [Created] (SPARK-1418) Python MLlib's _get_unmangled_rdd should uncache RDDs when training is done

Matei Zaharia (JIRA) Fri, 04 Apr 2014 15:34:27 -0700

Matei Zaharia created SPARK-1418:
------------------------------------

             Summary: Python MLlib's _get_unmangled_rdd should uncache RDDs 
when training is done
                 Key: SPARK-1418
                 URL: https://issues.apache.org/jira/browse/SPARK-1418
             Project: Spark
          Issue Type: Improvement
          Components: MLlib, PySpark
            Reporter: Matei Zaharia



Right now when PySpark converts a Python RDD of NumPy vectors to a Java one, it 
caches the Java one, since many of the algorithms are iterative. We should call 
unpersist() at the end of the algorithm though to free cache space. In addition 
it may be good to persist the Java RDD with StorageLevel.MEMORY_AND_DISK 
instead of going back through the NumPy conversion.. it will almost certainly 
be faster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-1418) Python MLlib's _get_unmangled_rdd should uncache RDDs when training is done

Reply via email to