[jira] [Commented] (MAHOUT-512) Principal Component Analysis for mahout-examples

Raphael Cendrillon (Commented) (JIRA) Sat, 26 Nov 2011 21:03:06 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157657#comment-13157657
 ]


Raphael Cendrillon commented on MAHOUT-512:
-------------------------------------------

A few comments might be in order:

 - SVD is currently done sequentially. This means that the code is scalable 
with size of training data but not so much for feature size. I'd like to hook 
this into SSVD to make the feature size scalable as well.
 - There is a fair ammount of code duplication between PcaMatrix and 
DistributedRowMatrix. This has been done to minimize impact on existing code. 
If MatrixCovarianceJob, MatrixRowAverageJob, MatrixVectorAdditionJob etc. are 
found to be useful in other places then it might be worth integrating these 
jobs directly into DistributedRowMatrix
 - Projection into feature space and reconstruction using principal components 
is currently done with TransposeJob and MatrixMultiplicationJob. It may be 
worth implementing these as seperate M/R jobs for higher efficiency
 - The is some code replication between some of the jobs. It might be 
worthwhile to push the common code into an abstract class
 - Currently MatrixVectorAdditionJob uses an outer join to subtract the vector 
from each row of the matrix. There are probably more efficient ways to do this. 
                
> Principal Component Analysis for mahout-examples
> ------------------------------------------------
>
>                 Key: MAHOUT-512
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-512
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Max Heimel
>            Assignee: Max Heimel
>            Priority: Minor
>         Attachments: MAHOUT-512.patch
>
>   Original Estimate: 2,016h
>  Remaining Estimate: 2,016h
>
> h2.Overview:
> Principle Component Analysis is a widely used statistical method for 
> decorrelation and dimension-reduction. It is typically applied for denoising 
> data, data compression and as a preprocessing step for more advanced 
> statistical analysis tools (like e.g. independent component analysis). The 
> main idea of PCA is to apply a linear transformation of the given data points 
> into a - typically less-dimensional - feature space, such that the direction 
> of greatest variance within the data lies on the first dimension within the 
> feature space.
> One approach to performing PCA is by transforming the d-dimensional data into 
> the space spanned by the p largest eigenvectors of the covariance matrix of 
> the (centered) data. This is done in the following steps (assume the data is 
> given by a nxd data matrix containing n data rows with dimensionality d):
> * Compute the d-dimensional empirical mean-vector m of the data points
> * Center the data by subtracting m from each data point.
> * Compute the covariance matrix C = E' * E, where E is the centered data 
> matrix and E' is its transpose.
> * Compute the eigenvalue decomposition of C, such that e_i is the eigenvector 
> corresponding to eigenvalue lambda_i, order i such that lambda_i > lambda_j 
> implies i<j.
> * Create the transformation matrix T as the p largest eigenvectors (e_0 ... 
> e_p) of C.
> * Transforming a data matrix into the PCA space: R=(D-M)*T
> * Transforming data from PCA to original space: D=(R*T')+M 
> h2.Implementation overview:
> I would suggest implementing PCA for the mahout-examples subproject. Maybe 
> some of the jobs could also go into mahout-core/math, if wished. The 
> implementation would consist of a set of M/R jobs and some gluing code. I 
> assume data is given in HDFS as a file of d-dimensional data points, where 
> each row corresponds to a data point. The map-reduce based implementation of 
> PCA for Mahout would then consist of the following code:
> # Distributed Data centering code, that computes the emprical mean vector and 
> centers the data
> # Distributed Code for computing the covariance matrix.
> # Sequential ( ? )  glue code for computing the eigenvalue decomposition of 
> the covariance matrix and constructing the transformation matrices T and T'. 
> This code would use the existing eigensolvers of Mahout. 
> # Distributed code to transform data into the PCA space. This code would use 
> the existing distributed matrix multiplication of mahout & the data centering 
> job
> # Distributed code to transforms PCA  data back into the original space. This 
> code would use the existing distributed matrix multiplication of mahout & the 
> data centering job.
> # Glue code that combines all jobs into a coherent implementation of PCA.
> h2.Implementation details:
> For a complete implementation the following three map/reduce jobs have to be 
> written:
> # A M/R job to compute the empirical mean vector. Each row gets a row (data 
> point) and emits (dimension, [value,1]) tuples. A local combiner 
> preaggregates the values for each dimension to (dimension, [sum of values, nr 
> of datapoints]). Finally, a (single) reducer computes the average for each 
> dimension and stores back the mean vector.
> # A M/R job to center (uncenter) the data. The job consists of a single 
> map-phase that reads in a row (data point) of the original matrix, subtracts 
> (or adds) the empirical mean vector and emits (row number, new ro) pairs that 
> are written back to disk.
> # A M/R job to compute the matrix product of a large matrix with its 
> transpose. Principle idea: each mapper reads in a row of the matrix and 
> computes the products of all combinations (e.g. [a,b,c] --> 
> (0,aa)(1,ab)(2,ac)(3,bb)(4,bc)(5,cc)). The key corresponds to the position 
> within the symmetric result matrix. A local combiner sums up the local 
> products for each key, the reducer sums up the sum of products for each 
> mapper per key and stores (position, value) tuples to disk. Finally a (local) 
> clean-up phase constructs the covariance matrix from the (position, value) 
> tuples. Since the covariance matrix is dxd and d is typically low compared to 
> the number of data points, I assume the final local step should be fine.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-512) Principal Component Analysis for mahout-examples

Reply via email to