Repository: spark
Updated Branches:
  refs/heads/master e1f4de4a7 -> fdd466bed


[SPARK-10182] [MLLIB] GeneralizedLinearModel doesn't unpersist cached data

`GeneralizedLinearModel` creates a cached RDD when building a model. It's 
inconvenient, since these RDDs flood the memory when building several models in 
a row, so useful data might get evicted from the cache.

The proposed solution is to always cache the dataset & remove the warning. 
There's a caveat though: input dataset gets evaluated twice, in line 270 when 
fitting `StandardScaler` for the first time, and when running optimizer for the 
second time. So, it might worth to return removed warning.

Another possible solution is to disable caching entirely & return removed 
warning. I don't really know what approach is better.

Author: Vyacheslav Baranov <slavik.bara...@gmail.com>

Closes #8395 from SlavikBaranov/SPARK-10182.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fdd466be
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fdd466be
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fdd466be

Branch: refs/heads/master
Commit: fdd466bed7a7151dd066d732ef98d225f4acda4a
Parents: e1f4de4
Author: Vyacheslav Baranov <slavik.bara...@gmail.com>
Authored: Thu Aug 27 18:56:18 2015 +0100
Committer: Sean Owen <so...@cloudera.com>
Committed: Thu Aug 27 18:56:18 2015 +0100

----------------------------------------------------------------------
 .../spark/mllib/regression/GeneralizedLinearAlgorithm.scala     | 5 +++++
 1 file changed, 5 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/fdd466be/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
----------------------------------------------------------------------
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
 
b/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
index 7e3b4d5..8f657bf 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
@@ -359,6 +359,11 @@ abstract class GeneralizedLinearAlgorithm[M <: 
GeneralizedLinearModel]
         + " parent RDDs are also uncached.")
     }
 
+    // Unpersist cached data
+    if (data.getStorageLevel != StorageLevel.NONE) {
+      data.unpersist(false)
+    }
+
     createModel(weights, intercept)
   }
 }


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to