GitHub user srowen opened a pull request:

    https://github.com/apache/incubator-spark/pull/629

    MLLIB-25: Implicit ALS runs out of memory for moderately large numbers of 
features

    There's a step in implicit ALS where the matrix `Yt * Y` is computed. It's 
computed as the sum of matrices; an f x f matrix is created for each of n 
user/item rows in a partition. In `ALS.scala:214`:
    
    ```
            factors.flatMapValues{ case factorArray =>
              factorArray.map{ vector =>
                val x = new DoubleMatrix(vector)
                x.mmul(x.transpose())
              }
            }.reduceByKeyLocally((a, b) => a.addi(b))
             .values
             .reduce((a, b) => a.addi(b))
    ```
    
    Completely correct, but there's a subtle but quite large memory problem 
here. map() is going to create all of these matrices in memory at once, when 
they don't need to ever all exist at the same time.
    For example, if a partition has n = 100000 rows, and f = 200, then this 
intermediate product requires 32GB of heap. The computation will never work 
unless you can cough up workers with (more than) that much heap.
    
    Fortunately there's a trivial change that fixes it; just add `.view` in 
there.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/srowen/incubator-spark 
ALSMatrixAllocationOptimization

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-spark/pull/629.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #629
    
----
commit e9a5d636b8a5d6288924ddf2871645c4eea41ffe
Author: Sean Owen <so...@cloudera.com>
Date:   2014-02-21T09:37:39Z

    Avoid unnecessary out of memory situation by not simultaneously allocating 
lots of matrices

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---

Reply via email to