Repository: spark Updated Branches: refs/heads/master e1f4de4a7 -> fdd466bed
[SPARK-10182] [MLLIB] GeneralizedLinearModel doesn't unpersist cached data `GeneralizedLinearModel` creates a cached RDD when building a model. It's inconvenient, since these RDDs flood the memory when building several models in a row, so useful data might get evicted from the cache. The proposed solution is to always cache the dataset & remove the warning. There's a caveat though: input dataset gets evaluated twice, in line 270 when fitting `StandardScaler` for the first time, and when running optimizer for the second time. So, it might worth to return removed warning. Another possible solution is to disable caching entirely & return removed warning. I don't really know what approach is better. Author: Vyacheslav Baranov <slavik.bara...@gmail.com> Closes #8395 from SlavikBaranov/SPARK-10182. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fdd466be Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fdd466be Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fdd466be Branch: refs/heads/master Commit: fdd466bed7a7151dd066d732ef98d225f4acda4a Parents: e1f4de4 Author: Vyacheslav Baranov <slavik.bara...@gmail.com> Authored: Thu Aug 27 18:56:18 2015 +0100 Committer: Sean Owen <so...@cloudera.com> Committed: Thu Aug 27 18:56:18 2015 +0100 ---------------------------------------------------------------------- .../spark/mllib/regression/GeneralizedLinearAlgorithm.scala | 5 +++++ 1 file changed, 5 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/fdd466be/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala ---------------------------------------------------------------------- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala b/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala index 7e3b4d5..8f657bf 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala @@ -359,6 +359,11 @@ abstract class GeneralizedLinearAlgorithm[M <: GeneralizedLinearModel] + " parent RDDs are also uncached.") } + // Unpersist cached data + if (data.getStorageLevel != StorageLevel.NONE) { + data.unpersist(false) + } + createModel(weights, intercept) } } --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org