Re: [jira] [Commented] (MAHOUT-512) Principal Component Analysis for mahout-examples

Raphael Cendrillon Sun, 27 Nov 2011 20:03:26 -0800

Yes. I think M_b can be calculated efficiently as qsum' * a_mean, where qsum is 
the sum of Q over rows.


Just to clarify, the data points are stored in the rows of A, so we would like 
to calculate the row average correct? I think there is some confusion on this.

On Nov 27, 2011, at 7:40 PM, "Ted Dunning (Commented) (JIRA)" <[email protected]> 
wrote:

> 
>    [ 
> https://issues.apache.org/jira/browse/MAHOUT-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158164#comment-13158164
>  ] 
> 
> Ted Dunning commented on MAHOUT-512:
> ------------------------------------
> 
> {quote}
> I think it is much easier to compute and substruct B mean during computation 
> of BB' in flight, and produce (B-M_b)(B-M_b)' instead.
> {quote}
> As long as M_b = Q' A_mean, then I think that this is good.  Since this will 
> have rank 0, you can compute it very efficiently on the fly and not store 
> everything.
> 
> 
>> Principal Component Analysis for mahout-examples
>> ------------------------------------------------
>> 
>>                Key: MAHOUT-512
>>                URL: https://issues.apache.org/jira/browse/MAHOUT-512
>>            Project: Mahout
>>         Issue Type: New Feature
>>           Reporter: Max Heimel
>>           Assignee: Max Heimel
>>           Priority: Minor
>>        Attachments: MAHOUT-512.patch
>> 
>>  Original Estimate: 2,016h
>> Remaining Estimate: 2,016h
>> 
>> h2.Overview:
>> Principle Component Analysis is a widely used statistical method for 
>> decorrelation and dimension-reduction. It is typically applied for denoising 
>> data, data compression and as a preprocessing step for more advanced 
>> statistical analysis tools (like e.g. independent component analysis). The 
>> main idea of PCA is to apply a linear transformation of the given data 
>> points into a - typically less-dimensional - feature space, such that the 
>> direction of greatest variance within the data lies on the first dimension 
>> within the feature space.
>> One approach to performing PCA is by transforming the d-dimensional data 
>> into the space spanned by the p largest eigenvectors of the covariance 
>> matrix of the (centered) data. This is done in the following steps (assume 
>> the data is given by a nxd data matrix containing n data rows with 
>> dimensionality d):
>> * Compute the d-dimensional empirical mean-vector m of the data points
>> * Center the data by subtracting m from each data point.
>> * Compute the covariance matrix C = E' * E, where E is the centered data 
>> matrix and E' is its transpose.
>> * Compute the eigenvalue decomposition of C, such that e_i is the 
>> eigenvector corresponding to eigenvalue lambda_i, order i such that lambda_i 
>> > lambda_j implies i<j.
>> * Create the transformation matrix T as the p largest eigenvectors (e_0 ... 
>> e_p) of C.
>> * Transforming a data matrix into the PCA space: R=(D-M)*T
>> * Transforming data from PCA to original space: D=(R*T')+M 
>> h2.Implementation overview:
>> I would suggest implementing PCA for the mahout-examples subproject. Maybe 
>> some of the jobs could also go into mahout-core/math, if wished. The 
>> implementation would consist of a set of M/R jobs and some gluing code. I 
>> assume data is given in HDFS as a file of d-dimensional data points, where 
>> each row corresponds to a data point. The map-reduce based implementation of 
>> PCA for Mahout would then consist of the following code:
>> # Distributed Data centering code, that computes the emprical mean vector 
>> and centers the data
>> # Distributed Code for computing the covariance matrix.
>> # Sequential ( ? )  glue code for computing the eigenvalue decomposition of 
>> the covariance matrix and constructing the transformation matrices T and T'. 
>> This code would use the existing eigensolvers of Mahout. 
>> # Distributed code to transform data into the PCA space. This code would use 
>> the existing distributed matrix multiplication of mahout & the data 
>> centering job
>> # Distributed code to transforms PCA  data back into the original space. 
>> This code would use the existing distributed matrix multiplication of mahout 
>> & the data centering job.
>> # Glue code that combines all jobs into a coherent implementation of PCA.
>> h2.Implementation details:
>> For a complete implementation the following three map/reduce jobs have to be 
>> written:
>> # A M/R job to compute the empirical mean vector. Each row gets a row (data 
>> point) and emits (dimension, [value,1]) tuples. A local combiner 
>> preaggregates the values for each dimension to (dimension, [sum of values, 
>> nr of datapoints]). Finally, a (single) reducer computes the average for 
>> each dimension and stores back the mean vector.
>> # A M/R job to center (uncenter) the data. The job consists of a single 
>> map-phase that reads in a row (data point) of the original matrix, subtracts 
>> (or adds) the empirical mean vector and emits (row number, new ro) pairs 
>> that are written back to disk.
>> # A M/R job to compute the matrix product of a large matrix with its 
>> transpose. Principle idea: each mapper reads in a row of the matrix and 
>> computes the products of all combinations (e.g. [a,b,c] --> 
>> (0,aa)(1,ab)(2,ac)(3,bb)(4,bc)(5,cc)). The key corresponds to the position 
>> within the symmetric result matrix. A local combiner sums up the local 
>> products for each key, the reducer sums up the sum of products for each 
>> mapper per key and stores (position, value) tuples to disk. Finally a 
>> (local) clean-up phase constructs the covariance matrix from the (position, 
>> value) tuples. Since the covariance matrix is dxd and d is typically low 
>> compared to the number of data points, I assume the final local step should 
>> be fine.
> 
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA 
> administrators: 
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
> 
>

Re: [jira] [Commented] (MAHOUT-512) Principal Component Analysis for mahout-examples

Reply via email to