Hi Steve, yes, you are obviously right. The computation and eigen-decomposition of the covariance matrix can be replaced by SVD. Thanks for the hint, shame on me :) I will adapt the proposal in JIRA accordingly.
Thanks Max On Sat, Sep 25, 2010 at 7:23 PM, Steve Lianoglou <[email protected]> wrote: > Hi, > > I'm just a casual lurker and am not too well versed in the mahout > universe, so please forgive the intrusion, but doesn't mahout already > have an SVD implementation? > > If so, then you've already got the lion's share of a PCA > implementation right there, no? Here's a nice/accessible overview: > > Singular value decomposition and principal component analysis > http://public.lanl.gov/mewall/kluwer2002.html > > See the section entitled "Relation to principal component analysis" > for an explanation of how the right/left singular vectors are related > to the eigenvectors of your cov(X) matrix (which is pretty much the > central "nut" for PCA). > > -steve > > On Sat, Sep 25, 2010 at 9:50 AM, Max Heimel (JIRA) <[email protected]> wrote: >> Principal Component Analysis for mahout-examples >> ------------------------------------------------ >> >> Key: MAHOUT-512 >> URL: https://issues.apache.org/jira/browse/MAHOUT-512 >> Project: Mahout >> Issue Type: New Feature >> Reporter: Max Heimel >> Assignee: Max Heimel >> Priority: Minor >> >> >> h2.Overview: >> Principle Component Analysis is a widely used statistical method for >> decorrelation and dimension-reduction. It is typically applied for denoising >> data, data compression and as a preprocessing step for more advanced >> statistical analysis tools (like e.g. independent component analysis). The >> main idea of PCA is to apply a linear transformation of the given data >> points into a - typically less-dimensional - feature space, such that the >> direction of greatest variance within the data lies on the first dimension >> within the feature space. >> One approach to performing PCA is by transforming the d-dimensional data >> into the space spanned by the p largest eigenvectors of the covariance >> matrix of the (centered) data. This is done in the following steps (assume >> the data is given by a nxd data matrix containing n data rows with >> dimensionality d): >> * Compute the d-dimensional empirical mean-vector m of the data points >> * Center the data by subtracting m from each data point. >> * Compute the covariance matrix C = E' * E, where E is the centered data >> matrix and E' is its transpose. >> * Compute the eigenvalue decomposition of C, such that e_i is the >> eigenvector corresponding to eigenvalue lambda_i, order i such that lambda_i >> > lambda_j implies i<j. >> * Create the transformation matrix T as the p largest eigenvectors (e_0 ... >> e_p) of C. >> * Transforming a data matrix into the PCA space: R=(D-M)*T >> * Transforming data from PCA to original space: D=(R*T')+M >> >> h2.Implementation overview: >> I would suggest implementing PCA for the mahout-examples subproject. Maybe >> some of the jobs could also go into mahout-core/math, if wished. The >> implementation would consist of a set of M/R jobs and some gluing code. I >> assume data is given in HDFS as a file of d-dimensional data points, where >> each row corresponds to a data point. The map-reduce based implementation of >> PCA for Mahout would then consist of the following code: >> # Distributed Data centering code, that computes the emprical mean vector >> and centers the data >> # Distributed Code for computing the covariance matrix. >> # Sequential ( ? ) glue code for computing the eigenvalue decomposition of >> the covariance matrix and constructing the transformation matrices T and T'. >> This code would use the existing eigensolvers of Mahout. >> # Distributed code to transform data into the PCA space. This code would use >> the existing distributed matrix multiplication of mahout & the data >> centering job >> # Distributed code to transforms PCA data back into the original space. >> This code would use the existing distributed matrix multiplication of mahout >> & the data centering job. >> # Glue code that combines all jobs into a coherent implementation of PCA. >> >> h2.Implementation details: >> For a complete implementation the following three map/reduce jobs have to be >> written: >> # A M/R job to compute the empirical mean vector. Each row gets a row (data >> point) and emits (dimension, [value,1]) tuples. A local combiner >> preaggregates the values for each dimension to (dimension, [sum of values, >> nr of datapoints]). Finally, a (single) reducer computes the average for >> each dimension and stores back the mean vector. >> # A M/R job to center (uncenter) the data. The job consists of a single >> map-phase that reads in a row (data point) of the original matrix, subtracts >> (or adds) the empirical mean vector and emits (row number, new ro) pairs >> that are written back to disk. >> # A M/R job to compute the matrix product of a large matrix with its >> transpose. Principle idea: each mapper reads in a row of the matrix and >> computes the products of all combinations (e.g. [a,b,c] --> >> (0,aa)(1,ab)(2,ac)(3,bb)(4,bc)(5,cc)). The key corresponds to the position >> within the symmetric result matrix. A local combiner sums up the local >> products for each key, the reducer sums up the sum of products for each >> mapper per key and stores (position, value) tuples to disk. Finally a >> (local) clean-up phase constructs the covariance matrix from the (position, >> value) tuples. Since the covariance matrix is dxd and d is typically low >> compared to the number of data points, I assume the final local step should >> be fine. >> >> -- >> This message is automatically generated by JIRA. >> - >> You can reply to this email to add a comment to the issue online. >> >> > > > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact >
