I agree with Kasper, this is a 'big' issue. Does your method of taking only n PCs reduce the load on memory?
The new addition to the summary looks like a good idea, but Proportion of Variance as you describe it may be confusing to new users. Am I correct in saying Proportion of variance describes the amount of variance with respect to the number of components the user chooses to show? So if I only choose one I will explain 100% of the variance? I think showing 'Total Proportion of Variance' is important if that is the case. Regards, Steve Bronder Website: stevebronder.com Phone: 412-719-1282 Email: sbron...@stevebronder.com On Thu, Mar 24, 2016 at 2:58 PM, Kasper Daniel Hansen < kasperdanielhan...@gmail.com> wrote: > Martin, I fully agree. This becomes an issue when you have big matrices. > > (Note that there are awesome methods for actually only computing a small > number of PCs (unlike your code which uses svn which gets all of them); > these are available in various CRAN packages). > > Best, > Kasper > > On Thu, Mar 24, 2016 at 1:09 PM, Martin Maechler < > maech...@stat.math.ethz.ch > > wrote: > > > Following from the R-help thread of March 22 on "Memory usage in prcomp", > > > > I've started looking into adding an optional 'rank.' argument > > to prcomp allowing to more efficiently get only a few PCs > > instead of the full p PCs, say when p = 1000 and you know you > > only want 5 PCs. > > > > (https://stat.ethz.ch/pipermail/r-help/2016-March/437228.html > > > > As it was mentioned, we already have an optional 'tol' argument > > which allows *not* to choose all PCs. > > > > When I do that, > > say > > > > C <- chol(S <- toeplitz(.9 ^ (0:31))) # Cov.matrix and its root > > all.equal(S, crossprod(C)) > > set.seed(17) > > X <- matrix(rnorm(32000), 1000, 32) > > Z <- X %*% C ## ==> cov(Z) ~= C'C = S > > all.equal(cov(Z), S, tol = 0.08) > > pZ <- prcomp(Z, tol = 0.1) > > summary(pZ) # only ~14 PCs (out of 32) > > > > I get for the last line, the summary.prcomp(.) call : > > > > > summary(pZ) # only ~14 PCs (out of 32) > > Importance of components: > > PC1 PC2 PC3 PC4 PC5 PC6 > > PC7 PC8 > > Standard deviation 3.6415 2.7178 1.8447 1.3943 1.10207 0.90922 > 0.76951 > > 0.67490 > > Proportion of Variance 0.4352 0.2424 0.1117 0.0638 0.03986 0.02713 > 0.01943 > > 0.01495 > > Cumulative Proportion 0.4352 0.6775 0.7892 0.8530 0.89288 0.92001 > 0.93944 > > 0.95439 > > PC9 PC10 PC11 PC12 PC13 PC14 > > Standard deviation 0.60833 0.51638 0.49048 0.44452 0.40326 0.3904 > > Proportion of Variance 0.01214 0.00875 0.00789 0.00648 0.00534 0.0050 > > Cumulative Proportion 0.96653 0.97528 0.98318 0.98966 0.99500 1.0000 > > > > > > > which computes the *proportions* as if there were only 14 PCs in > > total (but there were 32 originally). > > > > I would think that the summary should or could in addition show > > the usual "proportion of variance explained" like result which > > does involve all 32 variances or std.dev.s ... which are > > returned from the svd() anyway, even in the case when I use my > > new 'rank.' argument which only returns a "few" PCs instead of > > all. > > > > Would you think the current summary() output is good enough or > > rather misleading? > > > > I think I would want to see (possibly in addition) proportions > > with respect to the full variance and not just to the variance > > of those few components selected. > > > > Opinions? > > > > Martin Maechler > > ETH Zurich > > > > ______________________________________________ > > R-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel