Yes. I think M_b can be calculated efficiently as qsum' * a_mean, where qsum is the sum of Q over rows.
Just to clarify, the data points are stored in the rows of A, so we would like to calculate the row average correct? I think there is some confusion on this. On Nov 27, 2011, at 7:40 PM, "Ted Dunning (Commented) (JIRA)" <[email protected]> wrote: > > [ > https://issues.apache.org/jira/browse/MAHOUT-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158164#comment-13158164 > ] > > Ted Dunning commented on MAHOUT-512: > ------------------------------------ > > {quote} > I think it is much easier to compute and substruct B mean during computation > of BB' in flight, and produce (B-M_b)(B-M_b)' instead. > {quote} > As long as M_b = Q' A_mean, then I think that this is good. Since this will > have rank 0, you can compute it very efficiently on the fly and not store > everything. > > >> Principal Component Analysis for mahout-examples >> ------------------------------------------------ >> >> Key: MAHOUT-512 >> URL: https://issues.apache.org/jira/browse/MAHOUT-512 >> Project: Mahout >> Issue Type: New Feature >> Reporter: Max Heimel >> Assignee: Max Heimel >> Priority: Minor >> Attachments: MAHOUT-512.patch >> >> Original Estimate: 2,016h >> Remaining Estimate: 2,016h >> >> h2.Overview: >> Principle Component Analysis is a widely used statistical method for >> decorrelation and dimension-reduction. It is typically applied for denoising >> data, data compression and as a preprocessing step for more advanced >> statistical analysis tools (like e.g. independent component analysis). The >> main idea of PCA is to apply a linear transformation of the given data >> points into a - typically less-dimensional - feature space, such that the >> direction of greatest variance within the data lies on the first dimension >> within the feature space. >> One approach to performing PCA is by transforming the d-dimensional data >> into the space spanned by the p largest eigenvectors of the covariance >> matrix of the (centered) data. This is done in the following steps (assume >> the data is given by a nxd data matrix containing n data rows with >> dimensionality d): >> * Compute the d-dimensional empirical mean-vector m of the data points >> * Center the data by subtracting m from each data point. >> * Compute the covariance matrix C = E' * E, where E is the centered data >> matrix and E' is its transpose. >> * Compute the eigenvalue decomposition of C, such that e_i is the >> eigenvector corresponding to eigenvalue lambda_i, order i such that lambda_i >> > lambda_j implies i<j. >> * Create the transformation matrix T as the p largest eigenvectors (e_0 ... >> e_p) of C. >> * Transforming a data matrix into the PCA space: R=(D-M)*T >> * Transforming data from PCA to original space: D=(R*T')+M >> h2.Implementation overview: >> I would suggest implementing PCA for the mahout-examples subproject. Maybe >> some of the jobs could also go into mahout-core/math, if wished. The >> implementation would consist of a set of M/R jobs and some gluing code. I >> assume data is given in HDFS as a file of d-dimensional data points, where >> each row corresponds to a data point. The map-reduce based implementation of >> PCA for Mahout would then consist of the following code: >> # Distributed Data centering code, that computes the emprical mean vector >> and centers the data >> # Distributed Code for computing the covariance matrix. >> # Sequential ( ? ) glue code for computing the eigenvalue decomposition of >> the covariance matrix and constructing the transformation matrices T and T'. >> This code would use the existing eigensolvers of Mahout. >> # Distributed code to transform data into the PCA space. This code would use >> the existing distributed matrix multiplication of mahout & the data >> centering job >> # Distributed code to transforms PCA data back into the original space. >> This code would use the existing distributed matrix multiplication of mahout >> & the data centering job. >> # Glue code that combines all jobs into a coherent implementation of PCA. >> h2.Implementation details: >> For a complete implementation the following three map/reduce jobs have to be >> written: >> # A M/R job to compute the empirical mean vector. Each row gets a row (data >> point) and emits (dimension, [value,1]) tuples. A local combiner >> preaggregates the values for each dimension to (dimension, [sum of values, >> nr of datapoints]). Finally, a (single) reducer computes the average for >> each dimension and stores back the mean vector. >> # A M/R job to center (uncenter) the data. The job consists of a single >> map-phase that reads in a row (data point) of the original matrix, subtracts >> (or adds) the empirical mean vector and emits (row number, new ro) pairs >> that are written back to disk. >> # A M/R job to compute the matrix product of a large matrix with its >> transpose. Principle idea: each mapper reads in a row of the matrix and >> computes the products of all combinations (e.g. [a,b,c] --> >> (0,aa)(1,ab)(2,ac)(3,bb)(4,bc)(5,cc)). The key corresponds to the position >> within the symmetric result matrix. A local combiner sums up the local >> products for each key, the reducer sums up the sum of products for each >> mapper per key and stores (position, value) tuples to disk. Finally a >> (local) clean-up phase constructs the covariance matrix from the (position, >> value) tuples. Since the covariance matrix is dxd and d is typically low >> compared to the number of data points, I assume the final local step should >> be fine. > > -- > This message is automatically generated by JIRA. > If you think it was sent incorrectly, please contact your JIRA > administrators: > https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa > For more information on JIRA, see: http://www.atlassian.com/software/jira > >
