Li Pu created SPARK-1782:
----------------------------

             Summary: svd for sparse matrix using ARPACK
                 Key: SPARK-1782
                 URL: https://issues.apache.org/jira/browse/SPARK-1782
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
            Reporter: Li Pu


Currently the svd implementation in mllib calls the dense matrix svd in breeze, 
which has a limitation of fitting n^2 Gram matrix entries in memory (n is the 
number of rows or number of columns of the matrix, whichever is smaller). In 
many use cases, the original matrix is sparse but the Gram matrix might not, 
and we often need only the largest k singular values/vectors. To make svd 
really scalable, the memory usage must be propositional to the non-zero entries 
in the matrix. 

One solution is to call the de facto standard eigen-decomposition package 
ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors of 
M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the 
eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a 
reverse communication interface. The user provides a function to multiply a 
square matrix to be decomposed with a dense vector provided by ARPACK, and 
return the resulting dense vector to ARPACK. Inside ARPACK it uses an 
Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we need 
to provide are two matrix-vector multiplications, first M*x then M^t*x. These 
multiplications can be done in Spark in a distributed manner.

The working memory used by ARPACK is O(n*k). When k (the number of desired 
singular values) is small, it can be easily fit into the memory of the master 
machine. The overall model is master machine runs ARPACK, and distribute 
matrix-vector multiplication onto working executors in each iteration. 

I made a PR to breeze with an ARPACK-backed svds interface 
(https://github.com/scalanlp/breeze/pull/240). The interface takes anything 
that can be multiplied by a DenseVector. On Spark/milib side, just need to 
implement the sparsematrix-vector multiplication. 

It might take some time to optimize and fully test this implementation, so set 
the workload estimate to 4 weeks. 




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to