[CONF] Apache Mahout > Stochastic Singular Value Decomposition

confluence Mon, 07 May 2012 12:00:25 -0700

Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Stochastic Singular Value Decomposition 
(https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition)



Edited by Dmitriy Lyubimov:
---------------------------------------------------------------------
Stochastic SVD method in Mahout produces reduced rank Singular Value 
Decomposition output in its strict mathematical definition: A=USV'.

h5. The benefits over other methods are: 
* reduced flops required compared to Krylov subspace methods
* In map-reduce world, a fixed number of MR iterations required regardless of 
rank requested
* Tweak precision/speed balance with options.
* A is a Distributed Row Matrix where rows may be identified by any Writable 
(such as a document path). As such, it would work directly on the output of 
seq2sparse. 
* As of 0.7 trunk, includes PCA and dimensionality reduction workflow 
(EXPERIMENTAL! Feedback on performance/other PCA related issues/ blogs is 
greatly appreciated.)

map-reduce characteristics: 
SSVD uses at most 3 MR _sequential_ steps (map-only + map-reduce + 2 optional 
parallel map-reduce jobs) to produce reduced rank approximation of U, V and S 
matrices. Additionally, two more map-reduce steps are added for each power 
iteration step if requested.

h5. Potential drawbacks: 
* potentially less precise (but adding even one power iteration seems to fix 
that quite a bit).

h5. Documentation

[Overview and Usage|^SSVD-CLI.pdf]
Note: Please use 0.6 or later! for PCA workflow, please use 0.7 or later.

h5. Publications

[Nathan Halko's 
dissertation|http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf]
 "Randomized methods for computing low-rank
approximations of matrices" contains comprehensive definition of 
parallelization strategy taken in Mahout SSVD implementation and also some 
precision/scalability benchmarks, esp. w.r.t. Mahout Lanczos implementation on 
a typical corpus data set.

h5. R simulation
[Non-parallel SSVD simulation in R with power iterations and PCA 
options|^ssvd.R]. Note that this implementation is not most optimal for 
sequential flow solver, but it is for demonstration purposes only. 

However, try this R code to simulate a meaningful input:
{code:title=tests.R}
n<-1000
m<-2000
k<-10

qi<-1

#simulated input
svalsim<-diag(k:1)

usim<- qr.Q(qr(matrix(rnorm(m*k, mean=3), nrow=m,ncol=k)))
vsim<- qr.Q(qr( matrix(rnorm(n*k,mean=5), nrow=n,ncol=k)))


x<- usim %*% svalsim %*% t(vsim) 

{code}

and try to compare ssvd.svd\(x\) and stock svd\(x\) performance for the same 
rank k, notice the difference in the running time. Also play with power 
iterations (qIter) and compare accuracies of standard svd and SSVD.



Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action

[CONF] Apache Mahout > Stochastic Singular Value Decomposition

Reply via email to