Principal Component Analysis for mahout-examples
------------------------------------------------

                 Key: MAHOUT-512
                 URL: https://issues.apache.org/jira/browse/MAHOUT-512
             Project: Mahout
          Issue Type: New Feature
            Reporter: Max Heimel
            Assignee: Max Heimel
            Priority: Minor


h2.Overview:
Principle Component Analysis is a widely used statistical method for 
decorrelation and dimension-reduction. It is typically applied for denoising 
data, data compression and as a preprocessing step for more advanced 
statistical analysis tools (like e.g. independent component analysis). The main 
idea of PCA is to apply a linear transformation of the given data points into a 
- typically less-dimensional - feature space, such that the direction of 
greatest variance within the data lies on the first dimension within the 
feature space.
One approach to performing PCA is by transforming the d-dimensional data into 
the space spanned by the p largest eigenvectors of the covariance matrix of the 
(centered) data. This is done in the following steps (assume the data is given 
by a nxd data matrix containing n data rows with dimensionality d):
* Compute the d-dimensional empirical mean-vector m of the data points
* Center the data by subtracting m from each data point.
* Compute the covariance matrix C = E' * E, where E is the centered data matrix 
and E' is its transpose.
* Compute the eigenvalue decomposition of C, such that e_i is the eigenvector 
corresponding to eigenvalue lambda_i, order i such that lambda_i > lambda_j 
implies i<j.
* Create the transformation matrix T as the p largest eigenvectors (e_0 ... 
e_p) of C.
* Transforming a data matrix into the PCA space: R=(D-M)*T
* Transforming data from PCA to original space: D=(R*T')+M 

h2.Implementation overview:
I would suggest implementing PCA for the mahout-examples subproject. Maybe some 
of the jobs could also go into mahout-core/math, if wished. The implementation 
would consist of a set of M/R jobs and some gluing code. I assume data is given 
in HDFS as a file of d-dimensional data points, where each row corresponds to a 
data point. The map-reduce based implementation of PCA for Mahout would then 
consist of the following code:
# Distributed Data centering code, that computes the emprical mean vector and 
centers the data
# Distributed Code for computing the covariance matrix.
# Sequential ( ? )  glue code for computing the eigenvalue decomposition of the 
covariance matrix and constructing the transformation matrices T and T'. This 
code would use the existing eigensolvers of Mahout. 
# Distributed code to transform data into the PCA space. This code would use 
the existing distributed matrix multiplication of mahout & the data centering 
job
# Distributed code to transforms PCA  data back into the original space. This 
code would use the existing distributed matrix multiplication of mahout & the 
data centering job.
# Glue code that combines all jobs into a coherent implementation of PCA.

h2.Implementation details:
For a complete implementation the following three map/reduce jobs have to be 
written:
# A M/R job to compute the empirical mean vector. Each row gets a row (data 
point) and emits (dimension, [value,1]) tuples. A local combiner preaggregates 
the values for each dimension to (dimension, [sum of values, nr of 
datapoints]). Finally, a (single) reducer computes the average for each 
dimension and stores back the mean vector.
# A M/R job to center (uncenter) the data. The job consists of a single 
map-phase that reads in a row (data point) of the original matrix, subtracts 
(or adds) the empirical mean vector and emits (row number, new ro) pairs that 
are written back to disk.
# A M/R job to compute the matrix product of a large matrix with its transpose. 
Principle idea: each mapper reads in a row of the matrix and computes the 
products of all combinations (e.g. [a,b,c] --> 
(0,aa)(1,ab)(2,ac)(3,bb)(4,bc)(5,cc)). The key corresponds to the position 
within the symmetric result matrix. A local combiner sums up the local products 
for each key, the reducer sums up the sum of products for each mapper per key 
and stores (position, value) tuples to disk. Finally a (local) clean-up phase 
constructs the covariance matrix from the (position, value) tuples. Since the 
covariance matrix is dxd and d is typically low compared to the number of data 
points, I assume the final local step should be fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to