Hi Steve,

yes, you are obviously right. The computation and eigen-decomposition
of the covariance matrix can be replaced by SVD. Thanks for the hint,
shame on me :)
I will adapt the proposal in JIRA accordingly.

Thanks
Max

On Sat, Sep 25, 2010 at 7:23 PM, Steve Lianoglou
<[email protected]> wrote:
> Hi,
>
> I'm just a casual lurker and am not too well versed in the mahout
> universe, so please forgive the intrusion, but doesn't mahout already
> have an SVD implementation?
>
> If so, then you've already got the lion's share of a PCA
> implementation right there, no? Here's a nice/accessible overview:
>
> Singular value decomposition and principal component analysis
> http://public.lanl.gov/mewall/kluwer2002.html
>
> See the section entitled "Relation to principal component analysis"
> for an explanation of how the right/left singular vectors are related
> to the eigenvectors of your cov(X) matrix (which is pretty much the
> central "nut" for PCA).
>
> -steve
>
> On Sat, Sep 25, 2010 at 9:50 AM, Max Heimel (JIRA) <[email protected]> wrote:
>> Principal Component Analysis for mahout-examples
>> ------------------------------------------------
>>
>>                 Key: MAHOUT-512
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-512
>>             Project: Mahout
>>          Issue Type: New Feature
>>            Reporter: Max Heimel
>>            Assignee: Max Heimel
>>            Priority: Minor
>>
>>
>> h2.Overview:
>> Principle Component Analysis is a widely used statistical method for 
>> decorrelation and dimension-reduction. It is typically applied for denoising 
>> data, data compression and as a preprocessing step for more advanced 
>> statistical analysis tools (like e.g. independent component analysis). The 
>> main idea of PCA is to apply a linear transformation of the given data 
>> points into a - typically less-dimensional - feature space, such that the 
>> direction of greatest variance within the data lies on the first dimension 
>> within the feature space.
>> One approach to performing PCA is by transforming the d-dimensional data 
>> into the space spanned by the p largest eigenvectors of the covariance 
>> matrix of the (centered) data. This is done in the following steps (assume 
>> the data is given by a nxd data matrix containing n data rows with 
>> dimensionality d):
>> * Compute the d-dimensional empirical mean-vector m of the data points
>> * Center the data by subtracting m from each data point.
>> * Compute the covariance matrix C = E' * E, where E is the centered data 
>> matrix and E' is its transpose.
>> * Compute the eigenvalue decomposition of C, such that e_i is the 
>> eigenvector corresponding to eigenvalue lambda_i, order i such that lambda_i 
>> > lambda_j implies i<j.
>> * Create the transformation matrix T as the p largest eigenvectors (e_0 ... 
>> e_p) of C.
>> * Transforming a data matrix into the PCA space: R=(D-M)*T
>> * Transforming data from PCA to original space: D=(R*T')+M
>>
>> h2.Implementation overview:
>> I would suggest implementing PCA for the mahout-examples subproject. Maybe 
>> some of the jobs could also go into mahout-core/math, if wished. The 
>> implementation would consist of a set of M/R jobs and some gluing code. I 
>> assume data is given in HDFS as a file of d-dimensional data points, where 
>> each row corresponds to a data point. The map-reduce based implementation of 
>> PCA for Mahout would then consist of the following code:
>> # Distributed Data centering code, that computes the emprical mean vector 
>> and centers the data
>> # Distributed Code for computing the covariance matrix.
>> # Sequential ( ? )  glue code for computing the eigenvalue decomposition of 
>> the covariance matrix and constructing the transformation matrices T and T'. 
>> This code would use the existing eigensolvers of Mahout.
>> # Distributed code to transform data into the PCA space. This code would use 
>> the existing distributed matrix multiplication of mahout & the data 
>> centering job
>> # Distributed code to transforms PCA  data back into the original space. 
>> This code would use the existing distributed matrix multiplication of mahout 
>> & the data centering job.
>> # Glue code that combines all jobs into a coherent implementation of PCA.
>>
>> h2.Implementation details:
>> For a complete implementation the following three map/reduce jobs have to be 
>> written:
>> # A M/R job to compute the empirical mean vector. Each row gets a row (data 
>> point) and emits (dimension, [value,1]) tuples. A local combiner 
>> preaggregates the values for each dimension to (dimension, [sum of values, 
>> nr of datapoints]). Finally, a (single) reducer computes the average for 
>> each dimension and stores back the mean vector.
>> # A M/R job to center (uncenter) the data. The job consists of a single 
>> map-phase that reads in a row (data point) of the original matrix, subtracts 
>> (or adds) the empirical mean vector and emits (row number, new ro) pairs 
>> that are written back to disk.
>> # A M/R job to compute the matrix product of a large matrix with its 
>> transpose. Principle idea: each mapper reads in a row of the matrix and 
>> computes the products of all combinations (e.g. [a,b,c] --> 
>> (0,aa)(1,ab)(2,ac)(3,bb)(4,bc)(5,cc)). The key corresponds to the position 
>> within the symmetric result matrix. A local combiner sums up the local 
>> products for each key, the reducer sums up the sum of products for each 
>> mapper per key and stores (position, value) tuples to disk. Finally a 
>> (local) clean-up phase constructs the covariance matrix from the (position, 
>> value) tuples. Since the covariance matrix is dxd and d is typically low 
>> compared to the number of data points, I assume the final local step should 
>> be fine.
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>
>
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>

Reply via email to