Repository: incubator-spark
Updated Branches:
  refs/heads/branch-0.9 b3fff962e -> 998abaecb


MLLIB-25: Implicit ALS runs out of memory for moderately large numbers of 
features

There's a step in implicit ALS where the matrix `Yt * Y` is computed. It's 
computed as the sum of matrices; an f x f matrix is created for each of n 
user/item rows in a partition. In `ALS.scala:214`:

```
        factors.flatMapValues{ case factorArray =>
          factorArray.map{ vector =>
            val x = new DoubleMatrix(vector)
            x.mmul(x.transpose())
          }
        }.reduceByKeyLocally((a, b) => a.addi(b))
         .values
         .reduce((a, b) => a.addi(b))
```

Completely correct, but there's a subtle but quite large memory problem here. 
map() is going to create all of these matrices in memory at once, when they 
don't need to ever all exist at the same time.
For example, if a partition has n = 100000 rows, and f = 200, then this 
intermediate product requires 32GB of heap. The computation will never work 
unless you can cough up workers with (more than) that much heap.

Fortunately there's a trivial change that fixes it; just add `.view` in there.

Author: Sean Owen <so...@cloudera.com>

Closes #629 from srowen/ALSMatrixAllocationOptimization and squashes the 
following commits:

062cda9 [Sean Owen] Update style per review comments
e9a5d63 [Sean Owen] Avoid unnecessary out of memory situation by not 
simultaneously allocating lots of matrices

(cherry picked from commit c8a4c9b1f6005815f5a4a331970624d1706b6b13)
Signed-off-by: Reynold Xin <r...@apache.org>


Project: http://git-wip-us.apache.org/repos/asf/incubator-spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spark/commit/998abaec
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spark/tree/998abaec
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spark/diff/998abaec

Branch: refs/heads/branch-0.9
Commit: 998abaecbc76f0a2b0317350d0ee589e78b0fbb0
Parents: b3fff96
Author: Sean Owen <so...@cloudera.com>
Authored: Fri Feb 21 12:46:12 2014 -0800
Committer: Reynold Xin <r...@apache.org>
Committed: Fri Feb 21 13:39:17 2014 -0800

----------------------------------------------------------------------
 .../main/scala/org/apache/spark/mllib/recommendation/ALS.scala   | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spark/blob/998abaec/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala
----------------------------------------------------------------------
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala
index 89ee070..3e93402 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala
@@ -209,8 +209,8 @@ class ALS private (var numBlocks: Int, var rank: Int, var 
iterations: Int, var l
   def computeYtY(factors: RDD[(Int, Array[Array[Double]])]) = {
     if (implicitPrefs) {
       Option(
-        factors.flatMapValues{ case factorArray =>
-          factorArray.map{ vector =>
+        factors.flatMapValues { case factorArray =>
+          factorArray.view.map { vector =>
             val x = new DoubleMatrix(vector)
             x.mmul(x.transpose())
           }

Reply via email to