Deron Eriksson created SYSTEMML-1146:
----------------------------------------

             Summary: Improve PCA description in documentation
                 Key: SYSTEMML-1146
                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1146
             Project: SystemML
          Issue Type: Improvement
          Components: Documentation
            Reporter: Deron Eriksson
            Priority: Minor


David P. Nichols reports that the first sentence of the PCA description in the 
Algorithms Reference is inaccurate 
(http://apache.github.io/incubator-systemml/algorithms-matrix-factorization.html#principal-component-analysis).

"Principal Component Analysis (PCA) is a simple, non-parametric method to 
transform the given data set with possibly correlated columns into a set of 
linearly uncorrelated or orthogonal columns, called principal components." 

The problem with this statement is that principal component scores typically 
will not be uncorrelated unless the input data have been centered (or began 
with means of 0). Orthogonal and uncorrelated are not the same thing. Whether 
or not two vectors are orthogonal is a function of the raw values, while 
covariance and hence correlation are functions of the centered values. 

It looks like the text was taken from Wikipedia's Principal component analysis 
entry. Whoever wrote that part of that entry seems to be assuming that 
principal components analysis always involves working on a matrix of centered 
(or centered and scaled) data, but that is not always the case. The default in 
SystemML is not to center input columns, so typically resulting data columns 
will not be uncorrelated, though they will be orthogonal.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to