Re: [jira] [Commented] (MAHOUT-512) Principal Component Analysis for mahout-examples

Raphael Cendrillon Sat, 26 Nov 2011 23:42:56 -0800

Thanks Ted. I did notice some discussion in MAHOUT-817 on this, but I thought 
it might be worth providing a patch for 512 anyway.


Do you think there is value in making this as a straight-forward feature of the 
SSVD code?  If so I might start working on MAHOUT-817.

On 26 Nov, 2011, at 11:20 PM, Ted Dunning (Commented) (JIRA) wrote:

> 
>    [ 
> https://issues.apache.org/jira/browse/MAHOUT-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157689#comment-13157689
>  ] 
> 
> Ted Dunning commented on MAHOUT-512:
> ------------------------------------
> 
> Raphael,
> 
> No.  The magic subtraction trick is not supported yet.  If you look at the 
> SSVD algorithm for instance,
> you can do the subtraction as part of the original projection very easily.  
> Then in the reprojection step,
> you do the same thing.  The result is a PCA algorithm instead of an SVD, but 
> you have to change the plumbing
> a fair bit to get there.
> 
>> Principal Component Analysis for mahout-examples
>> ------------------------------------------------
>> 
>>                Key: MAHOUT-512
>>                URL: https://issues.apache.org/jira/browse/MAHOUT-512
>>            Project: Mahout
>>         Issue Type: New Feature
>>           Reporter: Max Heimel
>>           Assignee: Max Heimel
>>           Priority: Minor
>>        Attachments: MAHOUT-512.patch
>> 
>>  Original Estimate: 2,016h
>> Remaining Estimate: 2,016h
>> 
>> h2.Overview:
>> Principle Component Analysis is a widely used statistical method for 
>> decorrelation and dimension-reduction. It is typically applied for denoising 
>> data, data compression and as a preprocessing step for more advanced 
>> statistical analysis tools (like e.g. independent component analysis). The 
>> main idea of PCA is to apply a linear transformation of the given data 
>> points into a - typically less-dimensional - feature space, such that the 
>> direction of greatest variance within the data lies on the first dimension 
>> within the feature space.
>> One approach to performing PCA is by transforming the d-dimensional data 
>> into the space spanned by the p largest eigenvectors of the covariance 
>> matrix of the (centered) data. This is done in the following steps (assume 
>> the data is given by a nxd data matrix containing n data rows with 
>> dimensionality d):
>> * Compute the d-dimensional empirical mean-vector m of the data points
>> * Center the data by subtracting m from each data point.
>> * Compute the covariance matrix C = E' * E, where E is the centered data 
>> matrix and E' is its transpose.
>> * Compute the eigenvalue decomposition of C, such that e_i is the 
>> eigenvector corresponding to eigenvalue lambda_i, order i such that lambda_i 
>> > lambda_j implies i<j.
>> * Create the transformation matrix T as the p largest eigenvectors (e_0 ... 
>> e_p) of C.
>> * Transforming a data matrix into the PCA space: R=(D-M)*T
>> * Transforming data from PCA to original space: D=(R*T')+M 
>> h2.Implementation overview:
>> I would suggest implementing PCA for the mahout-examples subproject. Maybe 
>> some of the jobs could also go into mahout-core/math, if wished. The 
>> implementation would consist of a set of M/R jobs and some gluing code. I 
>> assume data is given in HDFS as a file of d-dimensional data points, where 
>> each row corresponds to a data point. The map-reduce based implementation of 
>> PCA for Mahout would then consist of the following code:
>> # Distributed Data centering code, that computes the emprical mean vector 
>> and centers the data
>> # Distributed Code for computing the covariance matrix.
>> # Sequential ( ? )  glue code for computing the eigenvalue decomposition of 
>> the covariance matrix and constructing the transformation matrices T and T'. 
>> This code would use the existing eigensolvers of Mahout. 
>> # Distributed code to transform data into the PCA space. This code would use 
>> the existing distributed matrix multiplication of mahout & the data 
>> centering job
>> # Distributed code to transforms PCA  data back into the original space. 
>> This code would use the existing distributed matrix multiplication of mahout 
>> & the data centering job.
>> # Glue code that combines all jobs into a coherent implementation of PCA.
>> h2.Implementation details:
>> For a complete implementation the following three map/reduce jobs have to be 
>> written:
>> # A M/R job to compute the empirical mean vector. Each row gets a row (data 
>> point) and emits (dimension, [value,1]) tuples. A local combiner 
>> preaggregates the values for each dimension to (dimension, [sum of values, 
>> nr of datapoints]). Finally, a (single) reducer computes the average for 
>> each dimension and stores back the mean vector.
>> # A M/R job to center (uncenter) the data. The job consists of a single 
>> map-phase that reads in a row (data point) of the original matrix, subtracts 
>> (or adds) the empirical mean vector and emits (row number, new ro) pairs 
>> that are written back to disk.
>> # A M/R job to compute the matrix product of a large matrix with its 
>> transpose. Principle idea: each mapper reads in a row of the matrix and 
>> computes the products of all combinations (e.g. [a,b,c] --> 
>> (0,aa)(1,ab)(2,ac)(3,bb)(4,bc)(5,cc)). The key corresponds to the position 
>> within the symmetric result matrix. A local combiner sums up the local 
>> products for each key, the reducer sums up the sum of products for each 
>> mapper per key and stores (position, value) tuples to disk. Finally a 
>> (local) clean-up phase constructs the covariance matrix from the (position, 
>> value) tuples. Since the covariance matrix is dxd and d is typically low 
>> compared to the number of data points, I assume the final local step should 
>> be fine.
> 
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA 
> administrators: 
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
> 
>

Re: [jira] [Commented] (MAHOUT-512) Principal Component Analysis for mahout-examples

Reply via email to