[
https://issues.apache.org/jira/browse/MADLIB-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15138021#comment-15138021
]
Frank McQuillan commented on MADLIB-948:
----------------------------------------
Please check the eigenvalue computation and compare with scikit learn or R for
small data set example:
* seems that eigenvalues in MADlib are actually square root of actual
eigenvalues, so they are not labeled correctly in the output table
* N and N-1 difference is OK (Bessel's correction)
Please change to compute and output the eigenvalues.
> Proportion of variance for PCA training function
> ------------------------------------------------
>
> Key: MADLIB-948
> URL: https://issues.apache.org/jira/browse/MADLIB-948
> Project: Apache MADlib
> Issue Type: New Feature
> Reporter: Frank McQuillan
> Priority: Minor
> Fix For: v2.0
>
>
> In future iterations of the pca_train command, is it feasible to insert
> another optional command called variance_proportion? Instead of specifying k
> principal components to compute, you instead specify the proportion of
> variance that you want your PCA vectors to account for. The number of
> principal vectors generated would depend the covariance matrix/correlation
> matrix (depending on whether you normalized or not) and variance_proportion.
> So if I specified that variance_proportion = .8, the algorithm would
> terminate after obtaining enough principal vectors so that the ratio of the
> sum of the eigenvalues collected thus far to the trace of the covariance
> matrix/correlation matrix (the sum of all of the eigenvalues of the
> covariance matrix/correlation matrix) is greater than or equal to .8. That
> is, the algorithm would terminate after collecting enough vectors to account
> for 80% of the total variance in the set of observations.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)