Re: [Rd] summary( prcomp(*, tol = .) ) -- and 'rank.'

Jari Oksanen Fri, 25 Mar 2016 05:55:55 -0700

> On 25 Mar 2016, at 11:45 am, peter dalgaard <pda...@gmail.com> wrote:
> 
>> 
>> On 25 Mar 2016, at 10:08 , Jari Oksanen <jari.oksa...@oulu.fi> wrote:
>> 
>>> 
>>> On 25 Mar 2016, at 10:41 am, peter dalgaard <pda...@gmail.com> wrote:
>>> 
>>> As I see it, the display showing the first p << n PCs adding up to 100% of 
>>> the variance is plainly wrong. 
>>> 
>>> I suspect it comes about via a mental short-circuit: If we try to control p 
>>> using a tolerance, then that amounts to saying that the remaining PCs are 
>>> effectively zero-variance, but that is (usually) not the intention at all. 
>>> 
>>> The common case is that the remainder terms have a roughly _constant_, 
>>> small-ish variance and are interpreted as noise. Of course the magnitude of 
>>> the noise is important information.  
>>> 
>> But then you should use Factor Analysis which has that concept of “noise” 
>> (unlike PCA).
> 
> Actually, FA has a slightly different concept of noise. PCA can be 
> interpreted as a purely technical operation, but also as an FA variant with 
> same variance for all components.
> 
> Specifically, FA is 
> 
> Sigma = LL' + Psi
> 
> with Psi a diagonal matrix. If Psi = sigma^2 I , then L can be determined (up 
> to rotation) as the first p components of PCA. (This is used in ML algorithms 
> for FA since it allows you to concentrate the likelihood to be a function of 
> Psi.)
> 
If I remember correctly, we took a correlation matrix and replaced the diagonal 
elements with variable “communalities” < 1 estimated by some trick, and then 
chunked that matrix into PCA and called the result FA. A more advanced way was 
to do this iteratively: take some first axes of PCA/FA, calculate diagonal 
elements from them & re-feed them into PCA. It was done like that because 
algorithms & computers were not strong enough for real FA. Now they are, and I 
think it would be better to treat PCA like PCA, at least in the default output 
of standard stats::summary function. So summary should show proportion of total 
variance (for people who think this is a cool thing to know) instead of showing 
a proportion of an unspecified part of the variance.


Cheers, Jari Oksanen (who now switches to listening to today’s Passion instead 
of continuing with PCA)


> Methods like PC regression are not being very specific about the model, but 
> the underlying line of thought is that PCs with small variances are 
> "uninformative", so that you can make do with only the first handful 
> regressors. I tend to interpret "uninformative" as "noise-like" in these 
> contexts.
> 
> -pd
> 
>> 
>> Cheers, Jari Oksanen
>> 
>>>> On 25 Mar 2016, at 00:02 , Steve Bronder <sbron...@stevebronder.com> wrote:
>>>> 
>>>> I agree with Kasper, this is a 'big' issue. Does your method of taking only
>>>> n PCs reduce the load on memory?
>>>> 
>>>> The new addition to the summary looks like a good idea, but Proportion of
>>>> Variance as you describe it may be confusing to new users. Am I correct in
>>>> saying Proportion of variance describes the amount of variance with respect
>>>> to the number of components the user chooses to show? So if I only choose
>>>> one I will explain 100% of the variance? I think showing 'Total Proportion
>>>> of Variance' is important if that is the case.
>>>> 
>>>> 
>>>> Regards,
>>>> 
>>>> Steve Bronder
>>>> Website: stevebronder.com
>>>> Phone: 412-719-1282
>>>> Email: sbron...@stevebronder.com
>>>> 
>>>> 
>>>> On Thu, Mar 24, 2016 at 2:58 PM, Kasper Daniel Hansen <
>>>> kasperdanielhan...@gmail.com> wrote:
>>>> 
>>>>> Martin, I fully agree.  This becomes an issue when you have big matrices.
>>>>> 
>>>>> (Note that there are awesome methods for actually only computing a small
>>>>> number of PCs (unlike your code which uses svn which gets all of them);
>>>>> these are available in various CRAN packages).
>>>>> 
>>>>> Best,
>>>>> Kasper
>>>>> 
>>>>> On Thu, Mar 24, 2016 at 1:09 PM, Martin Maechler <
>>>>> maech...@stat.math.ethz.ch
>>>>>> wrote:
>>>>> 
>>>>>> Following from the R-help thread of March 22 on "Memory usage in prcomp",
>>>>>> 
>>>>>> I've started looking into adding an optional   'rank.'  argument
>>>>>> to prcomp  allowing to more efficiently get only a few PCs
>>>>>> instead of the full p PCs, say when p = 1000 and you know you
>>>>>> only want 5 PCs.
>>>>>> 
>>>>>> (https://stat.ethz.ch/pipermail/r-help/2016-March/437228.html
>>>>>> 
>>>>>> As it was mentioned, we already have an optional 'tol' argument
>>>>>> which allows *not* to choose all PCs.
>>>>>> 
>>>>>> When I do that,
>>>>>> say
>>>>>> 
>>>>>>  C <- chol(S <- toeplitz(.9 ^ (0:31))) # Cov.matrix and its root
>>>>>>  all.equal(S, crossprod(C))
>>>>>>  set.seed(17)
>>>>>>  X <- matrix(rnorm(32000), 1000, 32)
>>>>>>  Z <- X %*% C  ## ==>  cov(Z) ~=  C'C = S
>>>>>>  all.equal(cov(Z), S, tol = 0.08)
>>>>>>  pZ <- prcomp(Z, tol = 0.1)
>>>>>>  summary(pZ) # only ~14 PCs (out of 32)
>>>>>> 
>>>>>> I get for the last line, the   summary.prcomp(.) call :
>>>>>> 
>>>>>>> summary(pZ) # only ~14 PCs (out of 32)
>>>>>> Importance of components:
>>>>>>                       PC1    PC2    PC3    PC4     PC5     PC6
>>>>>> PC7     PC8
>>>>>> Standard deviation     3.6415 2.7178 1.8447 1.3943 1.10207 0.90922
>>>>> 0.76951
>>>>>> 0.67490
>>>>>> Proportion of Variance 0.4352 0.2424 0.1117 0.0638 0.03986 0.02713
>>>>> 0.01943
>>>>>> 0.01495
>>>>>> Cumulative Proportion  0.4352 0.6775 0.7892 0.8530 0.89288 0.92001
>>>>> 0.93944
>>>>>> 0.95439
>>>>>>                        PC9    PC10    PC11    PC12    PC13   PC14
>>>>>> Standard deviation     0.60833 0.51638 0.49048 0.44452 0.40326 0.3904
>>>>>> Proportion of Variance 0.01214 0.00875 0.00789 0.00648 0.00534 0.0050
>>>>>> Cumulative Proportion  0.96653 0.97528 0.98318 0.98966 0.99500 1.0000
>>>>>>> 
>>>>>> 
>>>>>> which computes the *proportions* as if there were only 14 PCs in
>>>>>> total (but there were 32 originally).
>>>>>> 
>>>>>> I would think that the summary should  or could in addition show
>>>>>> the usual  "proportion of variance explained"  like result which
>>>>>> does involve all 32  variances or std.dev.s ... which are
>>>>>> returned from the svd() anyway, even in the case when I use my
>>>>>> new 'rank.' argument which only returns a "few" PCs instead of
>>>>>> all.
>>>>>> 
>>>>>> Would you think the current  summary() output is good enough or
>>>>>> rather misleading?
>>>>>> 
>>>>>> I think I would want to see (possibly in addition) proportions
>>>>>> with respect to the full variance and not just to the variance
>>>>>> of those few components selected.
>>>>>> 
>>>>>> Opinions?
>>>>>> 
>>>>>> Martin Maechler
>>>>>> ETH Zurich
>>>>>> 
>>>>>> ______________________________________________
>>>>>> R-devel@r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>> 
>>>>> 
>>>>>     [[alternative HTML version deleted]]
>>>>> 
>>>>> ______________________________________________
>>>>> R-devel@r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>> 
>>>> 
>>>>    [[alternative HTML version deleted]]
>>>> 
>>>> ______________________________________________
>>>> R-devel@r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>> 
>>> -- 
>>> Peter Dalgaard, Professor,
>>> Center for Statistics, Copenhagen Business School
>>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>>> Phone: (+45)38153501
>>> Office: A 4.23
>>> Email: pd....@cbs.dk  Priv: pda...@gmail.com
>>> 
>>> ______________________________________________
>>> R-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> -- 
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Office: A 4.23
> Email: pd....@cbs.dk  Priv: pda...@gmail.com

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] summary( prcomp(*, tol = .) ) -- and 'rank.'

Reply via email to