Deron Eriksson created SYSTEMML-1146:
----------------------------------------
Summary: Improve PCA description in documentation
Key: SYSTEMML-1146
URL: https://issues.apache.org/jira/browse/SYSTEMML-1146
Project: SystemML
Issue Type: Improvement
Components: Documentation
Reporter: Deron Eriksson
Priority: Minor
David P. Nichols reports that the first sentence of the PCA description in the
Algorithms Reference is inaccurate
(http://apache.github.io/incubator-systemml/algorithms-matrix-factorization.html#principal-component-analysis).
"Principal Component Analysis (PCA) is a simple, non-parametric method to
transform the given data set with possibly correlated columns into a set of
linearly uncorrelated or orthogonal columns, called principal components."
The problem with this statement is that principal component scores typically
will not be uncorrelated unless the input data have been centered (or began
with means of 0). Orthogonal and uncorrelated are not the same thing. Whether
or not two vectors are orthogonal is a function of the raw values, while
covariance and hence correlation are functions of the centered values.
It looks like the text was taken from Wikipedia's Principal component analysis
entry. Whoever wrote that part of that entry seems to be assuming that
principal components analysis always involves working on a matrix of centered
(or centered and scaled) data, but that is not always the case. The default in
SystemML is not to center input columns, so typically resulting data columns
will not be uncorrelated, though they will be orthogonal.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)