[ 
https://issues.apache.org/jira/browse/MAHOUT-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157929#comment-13157929
 ] 

Dmitriy Lyubimov edited comment on MAHOUT-512 at 11/27/11 4:58 PM:
-------------------------------------------------------------------

Sort of. 

I think it is much easier to compute and substruct B mean during computation of 
BB' in flight, and produce (B-M_b)(B-M_b)' instead.

But I need to give it a thought a little more. E.g. does it matter if we 
consider median of column vectors or row vectors here? it probably does but we 
can give only one option here and I haven't thought yet what is the best way to 
output it.
                
      was (Author: dlyubimov):
    Sort of. 

I think it is much easier to compute and substruct B mean during computation of 
BB' in flight, and produce (B-M_b)(B-M_b).
                  
> Principal Component Analysis for mahout-examples
> ------------------------------------------------
>
>                 Key: MAHOUT-512
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-512
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Max Heimel
>            Assignee: Max Heimel
>            Priority: Minor
>         Attachments: MAHOUT-512.patch
>
>   Original Estimate: 2,016h
>  Remaining Estimate: 2,016h
>
> h2.Overview:
> Principle Component Analysis is a widely used statistical method for 
> decorrelation and dimension-reduction. It is typically applied for denoising 
> data, data compression and as a preprocessing step for more advanced 
> statistical analysis tools (like e.g. independent component analysis). The 
> main idea of PCA is to apply a linear transformation of the given data points 
> into a - typically less-dimensional - feature space, such that the direction 
> of greatest variance within the data lies on the first dimension within the 
> feature space.
> One approach to performing PCA is by transforming the d-dimensional data into 
> the space spanned by the p largest eigenvectors of the covariance matrix of 
> the (centered) data. This is done in the following steps (assume the data is 
> given by a nxd data matrix containing n data rows with dimensionality d):
> * Compute the d-dimensional empirical mean-vector m of the data points
> * Center the data by subtracting m from each data point.
> * Compute the covariance matrix C = E' * E, where E is the centered data 
> matrix and E' is its transpose.
> * Compute the eigenvalue decomposition of C, such that e_i is the eigenvector 
> corresponding to eigenvalue lambda_i, order i such that lambda_i > lambda_j 
> implies i<j.
> * Create the transformation matrix T as the p largest eigenvectors (e_0 ... 
> e_p) of C.
> * Transforming a data matrix into the PCA space: R=(D-M)*T
> * Transforming data from PCA to original space: D=(R*T')+M 
> h2.Implementation overview:
> I would suggest implementing PCA for the mahout-examples subproject. Maybe 
> some of the jobs could also go into mahout-core/math, if wished. The 
> implementation would consist of a set of M/R jobs and some gluing code. I 
> assume data is given in HDFS as a file of d-dimensional data points, where 
> each row corresponds to a data point. The map-reduce based implementation of 
> PCA for Mahout would then consist of the following code:
> # Distributed Data centering code, that computes the emprical mean vector and 
> centers the data
> # Distributed Code for computing the covariance matrix.
> # Sequential ( ? )  glue code for computing the eigenvalue decomposition of 
> the covariance matrix and constructing the transformation matrices T and T'. 
> This code would use the existing eigensolvers of Mahout. 
> # Distributed code to transform data into the PCA space. This code would use 
> the existing distributed matrix multiplication of mahout & the data centering 
> job
> # Distributed code to transforms PCA  data back into the original space. This 
> code would use the existing distributed matrix multiplication of mahout & the 
> data centering job.
> # Glue code that combines all jobs into a coherent implementation of PCA.
> h2.Implementation details:
> For a complete implementation the following three map/reduce jobs have to be 
> written:
> # A M/R job to compute the empirical mean vector. Each row gets a row (data 
> point) and emits (dimension, [value,1]) tuples. A local combiner 
> preaggregates the values for each dimension to (dimension, [sum of values, nr 
> of datapoints]). Finally, a (single) reducer computes the average for each 
> dimension and stores back the mean vector.
> # A M/R job to center (uncenter) the data. The job consists of a single 
> map-phase that reads in a row (data point) of the original matrix, subtracts 
> (or adds) the empirical mean vector and emits (row number, new ro) pairs that 
> are written back to disk.
> # A M/R job to compute the matrix product of a large matrix with its 
> transpose. Principle idea: each mapper reads in a row of the matrix and 
> computes the products of all combinations (e.g. [a,b,c] --> 
> (0,aa)(1,ab)(2,ac)(3,bb)(4,bc)(5,cc)). The key corresponds to the position 
> within the symmetric result matrix. A local combiner sums up the local 
> products for each key, the reducer sums up the sum of products for each 
> mapper per key and stores (position, value) tuples to disk. Finally a (local) 
> clean-up phase constructs the covariance matrix from the (position, value) 
> tuples. Since the covariance matrix is dxd and d is typically low compared to 
> the number of data points, I assume the final local step should be fine.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to