> On 25 Mar 2016, at 11:45 am, peter dalgaard <pda...@gmail.com> wrote: > >> >> On 25 Mar 2016, at 10:08 , Jari Oksanen <jari.oksa...@oulu.fi> wrote: >> >>> >>> On 25 Mar 2016, at 10:41 am, peter dalgaard <pda...@gmail.com> wrote: >>> >>> As I see it, the display showing the first p << n PCs adding up to 100% of >>> the variance is plainly wrong. >>> >>> I suspect it comes about via a mental short-circuit: If we try to control p >>> using a tolerance, then that amounts to saying that the remaining PCs are >>> effectively zero-variance, but that is (usually) not the intention at all. >>> >>> The common case is that the remainder terms have a roughly _constant_, >>> small-ish variance and are interpreted as noise. Of course the magnitude of >>> the noise is important information. >>> >> But then you should use Factor Analysis which has that concept of “noise” >> (unlike PCA). > > Actually, FA has a slightly different concept of noise. PCA can be > interpreted as a purely technical operation, but also as an FA variant with > same variance for all components. > > Specifically, FA is > > Sigma = LL' + Psi > > with Psi a diagonal matrix. If Psi = sigma^2 I , then L can be determined (up > to rotation) as the first p components of PCA. (This is used in ML algorithms > for FA since it allows you to concentrate the likelihood to be a function of > Psi.) > If I remember correctly, we took a correlation matrix and replaced the diagonal elements with variable “communalities” < 1 estimated by some trick, and then chunked that matrix into PCA and called the result FA. A more advanced way was to do this iteratively: take some first axes of PCA/FA, calculate diagonal elements from them & re-feed them into PCA. It was done like that because algorithms & computers were not strong enough for real FA. Now they are, and I think it would be better to treat PCA like PCA, at least in the default output of standard stats::summary function. So summary should show proportion of total variance (for people who think this is a cool thing to know) instead of showing a proportion of an unspecified part of the variance.
Cheers, Jari Oksanen (who now switches to listening to today’s Passion instead of continuing with PCA) > Methods like PC regression are not being very specific about the model, but > the underlying line of thought is that PCs with small variances are > "uninformative", so that you can make do with only the first handful > regressors. I tend to interpret "uninformative" as "noise-like" in these > contexts. > > -pd > >> >> Cheers, Jari Oksanen >> >>>> On 25 Mar 2016, at 00:02 , Steve Bronder <sbron...@stevebronder.com> wrote: >>>> >>>> I agree with Kasper, this is a 'big' issue. Does your method of taking only >>>> n PCs reduce the load on memory? >>>> >>>> The new addition to the summary looks like a good idea, but Proportion of >>>> Variance as you describe it may be confusing to new users. Am I correct in >>>> saying Proportion of variance describes the amount of variance with respect >>>> to the number of components the user chooses to show? So if I only choose >>>> one I will explain 100% of the variance? I think showing 'Total Proportion >>>> of Variance' is important if that is the case. >>>> >>>> >>>> Regards, >>>> >>>> Steve Bronder >>>> Website: stevebronder.com >>>> Phone: 412-719-1282 >>>> Email: sbron...@stevebronder.com >>>> >>>> >>>> On Thu, Mar 24, 2016 at 2:58 PM, Kasper Daniel Hansen < >>>> kasperdanielhan...@gmail.com> wrote: >>>> >>>>> Martin, I fully agree. This becomes an issue when you have big matrices. >>>>> >>>>> (Note that there are awesome methods for actually only computing a small >>>>> number of PCs (unlike your code which uses svn which gets all of them); >>>>> these are available in various CRAN packages). >>>>> >>>>> Best, >>>>> Kasper >>>>> >>>>> On Thu, Mar 24, 2016 at 1:09 PM, Martin Maechler < >>>>> maech...@stat.math.ethz.ch >>>>>> wrote: >>>>> >>>>>> Following from the R-help thread of March 22 on "Memory usage in prcomp", >>>>>> >>>>>> I've started looking into adding an optional 'rank.' argument >>>>>> to prcomp allowing to more efficiently get only a few PCs >>>>>> instead of the full p PCs, say when p = 1000 and you know you >>>>>> only want 5 PCs. >>>>>> >>>>>> (https://stat.ethz.ch/pipermail/r-help/2016-March/437228.html >>>>>> >>>>>> As it was mentioned, we already have an optional 'tol' argument >>>>>> which allows *not* to choose all PCs. >>>>>> >>>>>> When I do that, >>>>>> say >>>>>> >>>>>> C <- chol(S <- toeplitz(.9 ^ (0:31))) # Cov.matrix and its root >>>>>> all.equal(S, crossprod(C)) >>>>>> set.seed(17) >>>>>> X <- matrix(rnorm(32000), 1000, 32) >>>>>> Z <- X %*% C ## ==> cov(Z) ~= C'C = S >>>>>> all.equal(cov(Z), S, tol = 0.08) >>>>>> pZ <- prcomp(Z, tol = 0.1) >>>>>> summary(pZ) # only ~14 PCs (out of 32) >>>>>> >>>>>> I get for the last line, the summary.prcomp(.) call : >>>>>> >>>>>>> summary(pZ) # only ~14 PCs (out of 32) >>>>>> Importance of components: >>>>>> PC1 PC2 PC3 PC4 PC5 PC6 >>>>>> PC7 PC8 >>>>>> Standard deviation 3.6415 2.7178 1.8447 1.3943 1.10207 0.90922 >>>>> 0.76951 >>>>>> 0.67490 >>>>>> Proportion of Variance 0.4352 0.2424 0.1117 0.0638 0.03986 0.02713 >>>>> 0.01943 >>>>>> 0.01495 >>>>>> Cumulative Proportion 0.4352 0.6775 0.7892 0.8530 0.89288 0.92001 >>>>> 0.93944 >>>>>> 0.95439 >>>>>> PC9 PC10 PC11 PC12 PC13 PC14 >>>>>> Standard deviation 0.60833 0.51638 0.49048 0.44452 0.40326 0.3904 >>>>>> Proportion of Variance 0.01214 0.00875 0.00789 0.00648 0.00534 0.0050 >>>>>> Cumulative Proportion 0.96653 0.97528 0.98318 0.98966 0.99500 1.0000 >>>>>>> >>>>>> >>>>>> which computes the *proportions* as if there were only 14 PCs in >>>>>> total (but there were 32 originally). >>>>>> >>>>>> I would think that the summary should or could in addition show >>>>>> the usual "proportion of variance explained" like result which >>>>>> does involve all 32 variances or std.dev.s ... which are >>>>>> returned from the svd() anyway, even in the case when I use my >>>>>> new 'rank.' argument which only returns a "few" PCs instead of >>>>>> all. >>>>>> >>>>>> Would you think the current summary() output is good enough or >>>>>> rather misleading? >>>>>> >>>>>> I think I would want to see (possibly in addition) proportions >>>>>> with respect to the full variance and not just to the variance >>>>>> of those few components selected. >>>>>> >>>>>> Opinions? >>>>>> >>>>>> Martin Maechler >>>>>> ETH Zurich >>>>>> >>>>>> ______________________________________________ >>>>>> R-devel@r-project.org mailing list >>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>>>>> >>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> ______________________________________________ >>>>> R-devel@r-project.org mailing list >>>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>>>> >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> ______________________________________________ >>>> R-devel@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> >>> -- >>> Peter Dalgaard, Professor, >>> Center for Statistics, Copenhagen Business School >>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark >>> Phone: (+45)38153501 >>> Office: A 4.23 >>> Email: pd....@cbs.dk Priv: pda...@gmail.com >>> >>> ______________________________________________ >>> R-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel > > -- > Peter Dalgaard, Professor, > Center for Statistics, Copenhagen Business School > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > Phone: (+45)38153501 > Office: A 4.23 > Email: pd....@cbs.dk Priv: pda...@gmail.com ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel