[ 
https://issues.apache.org/jira/browse/MADLIB-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143866#comment-15143866
 ] 

Orhan Kislal commented on MADLIB-948:
-------------------------------------

The implementation of this functionality has some implications on how the 
lanczos_bidiagonalize function works. The number of iterations for this 
function might be supplied by user; however, this value is lower bound by the 
number of desired principal components (k). Since this value is not known at 
the time of execution (the implementation assumes the k is set to min(row_dim, 
col_dim)) it is not possible to use the lanczos_iter parameter in conjunction 
with proportion of variance. It is technically possible to limit the output by 
both proportion and lanczos_iter but this will incur a significant overhead 
since we have to calculate the trace of the covariance matrix separately and 
will not be able to use the values we already have. On the other hand, we can 
apply the default value for lanczos_iter and inform the user that the 
proportion functionality does not support custom lanczos_iter values since this 
is an exploratory function by definition.

> Proportion of variance for PCA training function
> ------------------------------------------------
>
>                 Key: MADLIB-948
>                 URL: https://issues.apache.org/jira/browse/MADLIB-948
>             Project: Apache MADlib
>          Issue Type: New Feature
>            Reporter: Frank McQuillan
>            Priority: Minor
>             Fix For: v2.0
>
>
> In future iterations of the pca_train command, is it feasible to insert 
> another optional command called variance_proportion? Instead of specifying k 
> principal components to compute, you instead specify the proportion of 
> variance that you want your PCA vectors to account for. The number of 
> principal vectors generated would depend the covariance matrix/correlation 
> matrix (depending on whether you normalized or not) and variance_proportion. 
> So if I specified that variance_proportion = .8, the algorithm would 
> terminate after obtaining enough principal vectors so that the ratio of the 
> sum of the eigenvalues collected thus far to the trace of the covariance 
> matrix/correlation matrix (the sum of all of the eigenvalues of the 
> covariance matrix/correlation matrix) is greater than or equal to .8. That 
> is, the algorithm would terminate after collecting enough vectors to account 
> for 80% of the total variance in the set of observations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to