[
https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000232#comment-14000232
]
Li Pu commented on SPARK-1782:
------------------------------
[~mengxr] thank you for the comments! You are right, (A^T A) x can be done in a
single pass. What we need is actually a eigs solver. Seems that mllib depends
on Breeze 0.7 (though I cannot find this release version in scalanlp/breeze). I
think the code would be cleaner if we call eigs in Breeze? or do you prefer to
have more control in mllib by calling ARPACK directly?
Also thank you for the pointer to PROPACK. I looked into the svd routine. It
calls lapack dbdsqr for actual svd computation. I will try to find a better way
to incorporate Fortran routines in a distributed way. For now we can use ARPACK.
> svd for sparse matrix using ARPACK
> ----------------------------------
>
> Key: SPARK-1782
> URL: https://issues.apache.org/jira/browse/SPARK-1782
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Reporter: Li Pu
> Original Estimate: 672h
> Remaining Estimate: 672h
>
> Currently the svd implementation in mllib calls the dense matrix svd in
> breeze, which has a limitation of fitting n^2 Gram matrix entries in memory
> (n is the number of rows or number of columns of the matrix, whichever is
> smaller). In many use cases, the original matrix is sparse but the Gram
> matrix might not, and we often need only the largest k singular
> values/vectors. To make svd really scalable, the memory usage must be
> propositional to the non-zero entries in the matrix.
> One solution is to call the de facto standard eigen-decomposition package
> ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors
> of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the
> eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a
> reverse communication interface. The user provides a function to multiply a
> square matrix to be decomposed with a dense vector provided by ARPACK, and
> return the resulting dense vector to ARPACK. Inside ARPACK it uses an
> Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we
> need to provide are two matrix-vector multiplications, first M*x then M^t*x.
> These multiplications can be done in Spark in a distributed manner.
> The working memory used by ARPACK is O(n*k). When k (the number of desired
> singular values) is small, it can be easily fit into the memory of the master
> machine. The overall model is master machine runs ARPACK, and distribute
> matrix-vector multiplication onto working executors in each iteration.
> I made a PR to breeze with an ARPACK-backed svds interface
> (https://github.com/scalanlp/breeze/pull/240). The interface takes anything
> that can be multiplied by a DenseVector. On Spark/milib side, just need to
> implement the sparsematrix-vector multiplication.
> It might take some time to optimize and fully test this implementation, so
> set the workload estimate to 4 weeks.
--
This message was sent by Atlassian JIRA
(v6.2#6252)