[
https://issues.apache.org/jira/browse/MAHOUT-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157929#comment-13157929
]
Dmitriy Lyubimov edited comment on MAHOUT-512 at 11/27/11 4:58 PM:
-------------------------------------------------------------------
Sort of.
I think it is much easier to compute and substruct B mean during computation of
BB' in flight, and produce (B-M_b)(B-M_b)' instead.
But I need to give it a thought a little more. E.g. does it matter if we
consider median of column vectors or row vectors here? it probably does but we
can give only one option here and I haven't thought yet what is the best way to
output it.
was (Author: dlyubimov):
Sort of.
I think it is much easier to compute and substruct B mean during computation of
BB' in flight, and produce (B-M_b)(B-M_b).
> Principal Component Analysis for mahout-examples
> ------------------------------------------------
>
> Key: MAHOUT-512
> URL: https://issues.apache.org/jira/browse/MAHOUT-512
> Project: Mahout
> Issue Type: New Feature
> Reporter: Max Heimel
> Assignee: Max Heimel
> Priority: Minor
> Attachments: MAHOUT-512.patch
>
> Original Estimate: 2,016h
> Remaining Estimate: 2,016h
>
> h2.Overview:
> Principle Component Analysis is a widely used statistical method for
> decorrelation and dimension-reduction. It is typically applied for denoising
> data, data compression and as a preprocessing step for more advanced
> statistical analysis tools (like e.g. independent component analysis). The
> main idea of PCA is to apply a linear transformation of the given data points
> into a - typically less-dimensional - feature space, such that the direction
> of greatest variance within the data lies on the first dimension within the
> feature space.
> One approach to performing PCA is by transforming the d-dimensional data into
> the space spanned by the p largest eigenvectors of the covariance matrix of
> the (centered) data. This is done in the following steps (assume the data is
> given by a nxd data matrix containing n data rows with dimensionality d):
> * Compute the d-dimensional empirical mean-vector m of the data points
> * Center the data by subtracting m from each data point.
> * Compute the covariance matrix C = E' * E, where E is the centered data
> matrix and E' is its transpose.
> * Compute the eigenvalue decomposition of C, such that e_i is the eigenvector
> corresponding to eigenvalue lambda_i, order i such that lambda_i > lambda_j
> implies i<j.
> * Create the transformation matrix T as the p largest eigenvectors (e_0 ...
> e_p) of C.
> * Transforming a data matrix into the PCA space: R=(D-M)*T
> * Transforming data from PCA to original space: D=(R*T')+M
> h2.Implementation overview:
> I would suggest implementing PCA for the mahout-examples subproject. Maybe
> some of the jobs could also go into mahout-core/math, if wished. The
> implementation would consist of a set of M/R jobs and some gluing code. I
> assume data is given in HDFS as a file of d-dimensional data points, where
> each row corresponds to a data point. The map-reduce based implementation of
> PCA for Mahout would then consist of the following code:
> # Distributed Data centering code, that computes the emprical mean vector and
> centers the data
> # Distributed Code for computing the covariance matrix.
> # Sequential ( ? ) glue code for computing the eigenvalue decomposition of
> the covariance matrix and constructing the transformation matrices T and T'.
> This code would use the existing eigensolvers of Mahout.
> # Distributed code to transform data into the PCA space. This code would use
> the existing distributed matrix multiplication of mahout & the data centering
> job
> # Distributed code to transforms PCA data back into the original space. This
> code would use the existing distributed matrix multiplication of mahout & the
> data centering job.
> # Glue code that combines all jobs into a coherent implementation of PCA.
> h2.Implementation details:
> For a complete implementation the following three map/reduce jobs have to be
> written:
> # A M/R job to compute the empirical mean vector. Each row gets a row (data
> point) and emits (dimension, [value,1]) tuples. A local combiner
> preaggregates the values for each dimension to (dimension, [sum of values, nr
> of datapoints]). Finally, a (single) reducer computes the average for each
> dimension and stores back the mean vector.
> # A M/R job to center (uncenter) the data. The job consists of a single
> map-phase that reads in a row (data point) of the original matrix, subtracts
> (or adds) the empirical mean vector and emits (row number, new ro) pairs that
> are written back to disk.
> # A M/R job to compute the matrix product of a large matrix with its
> transpose. Principle idea: each mapper reads in a row of the matrix and
> computes the products of all combinations (e.g. [a,b,c] -->
> (0,aa)(1,ab)(2,ac)(3,bb)(4,bc)(5,cc)). The key corresponds to the position
> within the symmetric result matrix. A local combiner sums up the local
> products for each key, the reducer sums up the sum of products for each
> mapper per key and stores (position, value) tuples to disk. Finally a (local)
> clean-up phase constructs the covariance matrix from the (position, value)
> tuples. Since the covariance matrix is dxd and d is typically low compared to
> the number of data points, I assume the final local step should be fine.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira