Thank you all for your feedback.
The initial problem I came with wasnt the definition of PCA but what
the sklearn method does. In practice I would always make sure the data
is both centered and scaled before performing PCA. This is the
recommended method because without scaling, the biggest direction
could wrongly seem to explain a huge fraction of the variance.
So my point was simply to clarify in the help file and the user guide
what the PCA class does precisely to leave no unclarity to the reader.
Moving forward I have now submitted a pull request on github as
initially suggested by Roman on this thread.
Best,
Ismael
On Mon, 16 Oct 2017 at 11:49 AM, <scikit-learn-requ...@python.org
<mailto:scikit-learn-requ...@python.org>> wrote:
Send scikit-learn mailing list submissions to
scikit-learn@python.org <mailto:scikit-learn@python.org>
To subscribe or unsubscribe via the World Wide Web, visit
https://mail.python.org/mailman/listinfo/scikit-learn
or, via email, send a message with subject or body 'help' to
scikit-learn-requ...@python.org
<mailto:scikit-learn-requ...@python.org>
You can reach the person managing the list at
scikit-learn-ow...@python.org <mailto:scikit-learn-ow...@python.org>
When replying, please edit your Subject line so it is more specific
than "Re: Contents of scikit-learn digest..."
Today's Topics:
1. Re: 1. Re: unclear help file for sklearn.decomposition.pca
(Andreas Mueller)
2. Re: 1. Re: unclear help file for sklearn.decomposition.pca
(Oliver Tomic)
----------------------------------------------------------------------
Message: 1
Date: Mon, 16 Oct 2017 14:44:51 -0400
From: Andreas Mueller <t3k...@gmail.com <mailto:t3k...@gmail.com>>
To: scikit-learn@python.org <mailto:scikit-learn@python.org>
Subject: Re: [scikit-learn] 1. Re: unclear help file for
sklearn.decomposition.pca
Message-ID: <35142868-fce9-6cb3-eba3-015a0b106...@gmail.com
<mailto:35142868-fce9-6cb3-eba3-015a0b106...@gmail.com>>
Content-Type: text/plain; charset="utf-8"; Format="flowed"
On 10/16/2017 02:27 PM, Ismael Lemhadri wrote:
> @Andreas Muller:
> My references do not assume centering, e.g.
> http://ufldl.stanford.edu/wiki/index.php/PCA
> any reference?
>
It kinda does but is not very clear about it:
This data has already been pre-processed so that each of the
features\textstyle x_1and\textstyle x_2have about the same mean (zero)
and variance.
Wikipedia is much clearer:
Consider a datamatrix
<https://en.wikipedia.org/wiki/Matrix_%28mathematics%29>,*X*, with
column-wise zeroempirical mean
<https://en.wikipedia.org/wiki/Empirical_mean>(the sample mean of each
column has been shifted to zero), where each of the/n/rows
represents a
different repetition of the experiment, and each of the/p/columns
gives
a particular kind of feature (say, the results from a particular
sensor).
https://en.wikipedia.org/wiki/Principal_component_analysis#Details
I'm a bit surprised to find that ESL says "The SVD of the centered
matrix X is another way of expressing the principal components of the
variables in X",
so they assume scaling? They don't really have a great treatment
of PCA,
though.
Bishop <http://www.springer.com/us/book/9780387310732> and Murphy
<https://mitpress.mit.edu/books/machine-learning-0> are pretty clear
that they subtract the mean (or assume zero mean) but don't
standardize.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.python.org/pipermail/scikit-learn/attachments/20171016/81b3014b/attachment-0001.html>
------------------------------
Message: 2
Date: Mon, 16 Oct 2017 20:48:29 +0200
From: Oliver Tomic <oliverto...@zoho.com
<mailto:oliverto...@zoho.com>>
To: "Scikit-learn mailing list" <scikit-learn@python.org
<mailto:scikit-learn@python.org>>
Cc: <scikit-learn@python.org <mailto:scikit-learn@python.org>>
Subject: Re: [scikit-learn] 1. Re: unclear help file for
sklearn.decomposition.pca
Message-ID: <15f26840d65.e97b33c25239.3934951873824890...@zoho.com
<mailto:15f26840d65.e97b33c25239.3934951873824890...@zoho.com>>
Content-Type: text/plain; charset="utf-8"
Dear Ismael,
PCA should always involve at the least centering, or, if the
variables are to contribute equally, scaling. Here is a reference
from the scientific area named "chemometrics". In Chemometrics PCA
used not only for dimensionality reduction, but also for
interpretation of variance by use of scores, loadings, correlation
loadings, etc.
If you scroll down to subsection "Preprocessing" you will find
more info on centering and scaling.
http://pubs.rsc.org/en/content/articlehtml/2014/ay/c3ay41907j
best
Oliver
---- On Mon, 16 Oct 2017 20:27:11 +0200 Ismael Lemhadri
<lemha...@stanford.edu <mailto:lt%3blemha...@stanford.edu>>
wrote ----
@Andreas Muller:
My references do not assume centering, e.g.
http://ufldl.stanford.edu/wiki/index.php/PCA
any reference?
On Mon, Oct 16, 2017 at 10:20 AM,
<scikit-learn-requ...@python.org
<mailto:lt%3bscikit-learn-requ...@python.org>> wrote:
Send scikit-learn mailing list submissions to
scikit-learn@python.org <mailto:scikit-learn@python.org>
To subscribe or unsubscribe via the World Wide Web, visit
https://mail.python.org/mailman/listinfo/scikit-learn
or, via email, send a message with subject or body 'help' to
scikit-learn-requ...@python.org
<mailto:scikit-learn-requ...@python.org>
You can reach the person managing the list at
scikit-learn-ow...@python.org <mailto:scikit-learn-ow...@python.org>
When replying, please edit your Subject line so it is more specific
than "Re: Contents of scikit-learn digest..."
Today's Topics:
1. Re: unclear help file for sklearn.decomposition.pca
(Andreas Mueller)
----------------------------------------------------------------------
Message: 1
Date: Mon, 16 Oct 2017 13:19:57 -0400
From: Andreas Mueller <t3k...@gmail.com
<mailto:lt%3bt3k...@gmail.com>>
To: scikit-learn@python.org <mailto:scikit-learn@python.org>
Subject: Re: [scikit-learn] unclear help file for
sklearn.decomposition.pca
Message-ID: <04fc445c-d8f3-a3a9-4ab2-0535826a2...@gmail.com
<mailto:lt%3b04fc445c-d8f3-a3a9-4ab2-0535826a2...@gmail.com>>
Content-Type: text/plain; charset="utf-8"; Format="flowed"
The definition of PCA has a centering step, but no scaling step.
On 10/16/2017 11:16 AM, Ismael Lemhadri wrote:
> Dear Roman,
> My concern is actually not about not mentioning the scaling
but about
> not mentioning the centering.
> That is, the sklearn PCA removes the mean but it does not
mention it
> in the help file.
> This was quite messy for me to debug as I expected it to
either: 1/
> center and scale simultaneously or / not scale and not
center either.
> It would be beneficial to explicit the behavior in the help
file in my
> opinion.
> Ismael
>
> On Mon, Oct 16, 2017 at 8:02 AM,
<scikit-learn-requ...@python.org
<mailto:lt%3bscikit-learn-requ...@python.org>
> <mailto:scikit-learn-requ...@python.org
<mailto:scikit-learn-requ...@python.org>>> wrote:
>
> Send scikit-learn mailing list submissions to
> scikit-learn@python.org <mailto:scikit-learn@python.org>
<mailto:scikit-learn@python.org
<mailto:scikit-learn@python.org>>
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://mail.python.org/mailman/listinfo/scikit-learn
>
<https://mail.python.org/mailman/listinfo/scikit-learn>
> or, via email, send a message with subject or body 'help' to
> scikit-learn-requ...@python.org
<mailto:scikit-learn-requ...@python.org>
> <mailto:scikit-learn-requ...@python.org
<mailto:scikit-learn-requ...@python.org>>
>
> You can reach the person managing the list at
> scikit-learn-ow...@python.org
<mailto:scikit-learn-ow...@python.org>
<mailto:scikit-learn-ow...@python.org
<mailto:scikit-learn-ow...@python.org>>
>
> When replying, please edit your Subject line so it is
more specific
> than "Re: Contents of scikit-learn digest..."
>
>
> Today's Topics:
>
> ? ?1. unclear help file for sklearn.decomposition.pca
(Ismael
> Lemhadri)
> ? ?2. Re: unclear help file for sklearn.decomposition.pca
> ? ? ? (Roman Yurchak)
> ? ?3. Question about LDA's coef_ attribute (Serafeim Loukas)
> ? ?4. Re: Question about LDA's coef_ attribute
(Alexandre Gramfort)
> ? ?5. Re: Question about LDA's coef_ attribute (Serafeim
Loukas)
>
>
>
----------------------------------------------------------------------
>
> Message: 1
> Date: Sun, 15 Oct 2017 18:42:56 -0700
> From: Ismael Lemhadri <lemha...@stanford.edu
<mailto:lt%3blemha...@stanford.edu>
> <mailto:lemha...@stanford.edu
<mailto:lemha...@stanford.edu>>>
> To: scikit-learn@python.org
<mailto:scikit-learn@python.org>
<mailto:scikit-learn@python.org
<mailto:scikit-learn@python.org>>
> Subject: [scikit-learn] unclear help file for
> ? ? ? ? sklearn.decomposition.pca
> Message-ID:
> ? ? ? ?
>
<CANpSPFTgv+Oz7f97dandmrBBayqf_o9w=18okhcfn0u5dnz...@mail.gmail.com
<mailto:18okhcfn0u5dnzj%...@mail.gmail.com>
> <mailto:18okhcfn0u5dnzj%...@mail.gmail.com
<mailto:18okhcfn0u5dnzj%25...@mail.gmail.com>>>
> Content-Type: text/plain; charset="utf-8"
>
> Dear all,
> The help file for the PCA class is unclear about the
preprocessing
> performed to the data.
> You can check on line 410 here:
>
https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/
> decomposition/pca.py#L410
>
<https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/%0Adecomposition/pca.py#L410>
> that the matrix is centered but NOT scaled, before
performing the
> singular
> value decomposition.
> However, the help files do not make any mention of it.
> This is unclear for someone who, like me, just wanted to
compare
> that the
> PCA and np.linalg.svd give the same results. In academic
settings,
> students
> are often asked to compare different methods and to
check that
> they yield
> the same results. I expect that many students have
confronted this
> problem
> before...
> Best,
> Ismael Lemhadri
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
>
<http://mail.python.org/pipermail/scikit-learn/attachments/20171015/c465bde7/attachment-0001.html
>
<http://mail.python.org/pipermail/scikit-learn/attachments/20171015/c465bde7/attachment-0001.html>>
>
> ------------------------------
>
> Message: 2
> Date: Mon, 16 Oct 2017 15:16:45 +0200
> From: Roman Yurchak <rth.yurc...@gmail.com
<mailto:lt%3brth.yurc...@gmail.com>
> <mailto:rth.yurc...@gmail.com
<mailto:rth.yurc...@gmail.com>>>
> To: Scikit-learn mailing list
<scikit-learn@python.org <mailto:lt%3bscikit-le...@python.org>
> <mailto:scikit-learn@python.org
<mailto:scikit-learn@python.org>>>
> Subject: Re: [scikit-learn] unclear help file for
> ? ? ? ? sklearn.decomposition.pca
> Message-ID:
<b2abdcfd-4736-929e-6304-b93832932...@gmail.com
<mailto:lt%3bb2abdcfd-4736-929e-6304-b93832932...@gmail.com>
>
<mailto:b2abdcfd-4736-929e-6304-b93832932...@gmail.com
<mailto:b2abdcfd-4736-929e-6304-b93832932...@gmail.com>>>
> Content-Type: text/plain; charset=utf-8; format=flowed
>
> Ismael,
>
> as far as I saw the sklearn.decomposition.PCA doesn't
mention
> scaling at
> all (except for the whiten parameter which is
post-transformation
> scaling).
>
> So since it doesn't mention it, it makes sense that it
doesn't do any
> scaling of the input. Same as np.linalg.svd.
>
> You can verify that PCA and np.linalg.svd yield the same
results, with
>
> ```
> ?>>> import numpy as np
> ?>>> from sklearn.decomposition import PCA
> ?>>> import numpy.linalg
> ?>>> X = np.random.RandomState(42).rand(10, 4)
> ?>>> n_components = 2
> ?>>> PCA(n_components,
svd_solver='full').fit_transform(X)
> ```
>
> and
>
> ```
> ?>>> U, s, V = np.linalg.svd(X -
X.mean(axis=0), full_matrices=False)
> ?>>> (X - X.mean(axis=0)).dot(V[:n_components].T)
> ```
>
> --
> Roman
>
> On 16/10/17 03:42, Ismael Lemhadri wrote:
> > Dear all,
> > The help file for the PCA class is unclear about
the preprocessing
> > performed to the data.
> > You can check on line 410 here:
> >
>
https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/decomposition/pca.py#L410
>
<https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/decomposition/pca.py#L410>
> >
>
<https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/decomposition/pca.py#L410
>
<https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/decomposition/pca.py#L410>>
> > that the matrix is centered but NOT scaled, before
performing the
> > singular value decomposition.
> > However, the help files do not make any mention of it.
> > This is unclear for someone who, like me, just
wanted to compare
> that
> > the PCA and np.linalg.svd give the same results. In
academic
> settings,
> > students are often asked to compare different
methods and to
> check that
> > they yield the same results. I expect that many
students have
> confronted
> > this problem before...
> > Best,
> > Ismael Lemhadri
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn@python.org
<mailto:scikit-learn@python.org>
<mailto:scikit-learn@python.org
<mailto:scikit-learn@python.org>>
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
<https://mail.python.org/mailman/listinfo/scikit-learn>
> >
>
>
>
> ------------------------------
>
> Message: 3
> Date: Mon, 16 Oct 2017 15:27:48 +0200
> From: Serafeim Loukas <seral...@gmail.com
<mailto:lt%3bseral...@gmail.com> <mailto:seral...@gmail.com
<mailto:seral...@gmail.com>>>
> To: scikit-learn@python.org
<mailto:scikit-learn@python.org>
<mailto:scikit-learn@python.org
<mailto:scikit-learn@python.org>>
> Subject: [scikit-learn] Question about LDA's coef_ attribute
> Message-ID:
<58c6d0da-9de5-4ef5-97c1-48159831f...@gmail.com
<mailto:lt%3b58c6d0da-9de5-4ef5-97c1-48159831f...@gmail.com>
>
<mailto:58c6d0da-9de5-4ef5-97c1-48159831f...@gmail.com
<mailto:58c6d0da-9de5-4ef5-97c1-48159831f...@gmail.com>>>
> Content-Type: text/plain; charset="us-ascii"
>
> Dear Scikit-learn community,
>
> Since the documentation of the LDA
>
(http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html
>
<http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html>
>
<http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html
>
<http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html>>)
> is not so clear, I would like to ask if the lda.coef_
attribute
> stores the eigenvectors from the SVD decomposition.
>
> Thank you in advance,
> Serafeim
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
>
<http://mail.python.org/pipermail/scikit-learn/attachments/20171016/4263df5c/attachment-0001.html
>
<http://mail.python.org/pipermail/scikit-learn/attachments/20171016/4263df5c/attachment-0001.html>>
>
> ------------------------------
>
> Message: 4
> Date: Mon, 16 Oct 2017 16:57:52 +0200
> From: Alexandre Gramfort <alexandre.gramf...@inria.fr
<mailto:lt%3balexandre.gramf...@inria.fr>
> <mailto:alexandre.gramf...@inria.fr
<mailto:alexandre.gramf...@inria.fr>>>
> To: Scikit-learn mailing list
<scikit-learn@python.org <mailto:lt%3bscikit-le...@python.org>
> <mailto:scikit-learn@python.org
<mailto:scikit-learn@python.org>>>
> Subject: Re: [scikit-learn] Question about LDA's coef_
attribute
> Message-ID:
> ? ? ? ?
>
<cadeotzricoqhuhjmmw2z14cqffeqyndyoxn-ogkavtmq7v0...@mail.gmail.com
<mailto:lt%3bcadeotzricoqhuhjmmw2z14cqffeqyndyoxn-ogkavtmq7v0...@mail.gmail.com>
>
<mailto:cadeotzricoqhuhjmmw2z14cqffeqyndyoxn-ogkavtmq7v0...@mail.gmail.com
<mailto:cadeotzricoqhuhjmmw2z14cqffeqyndyoxn-ogkavtmq7v0...@mail.gmail.com>>>
> Content-Type: text/plain; charset="UTF-8"
>
> no it stores the direction of the decision function to
match the
> API of
> linear models.
>
> HTH
> Alex
>
> On Mon, Oct 16, 2017 at 3:27 PM, Serafeim Loukas
> <seral...@gmail.com <mailto:lt%3bseral...@gmail.com>
<mailto:seral...@gmail.com <mailto:seral...@gmail.com>>>
wrote:
> > Dear Scikit-learn community,
> >
> > Since the documentation of the LDA
> >
>
(http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html
>
<http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html>)
> > is not so clear, I would like to ask if the
lda.coef_ attribute
> stores the
> > eigenvectors from the SVD decomposition.
> >
> > Thank you in advance,
> > Serafeim
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn@python.org
<mailto:scikit-learn@python.org>
<mailto:scikit-learn@python.org
<mailto:scikit-learn@python.org>>
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
<https://mail.python.org/mailman/listinfo/scikit-learn>
> >
>
>
> ------------------------------
>
> Message: 5
> Date: Mon, 16 Oct 2017 17:02:46 +0200
> From: Serafeim Loukas <seral...@gmail.com
<mailto:lt%3bseral...@gmail.com> <mailto:seral...@gmail.com
<mailto:seral...@gmail.com>>>
> To: Scikit-learn mailing list
<scikit-learn@python.org <mailto:lt%3bscikit-le...@python.org>
> <mailto:scikit-learn@python.org
<mailto:scikit-learn@python.org>>>
> Subject: Re: [scikit-learn] Question about LDA's coef_
attribute
> Message-ID:
<413210d2-56ae-41a4-873f-d171bb365...@gmail.com
<mailto:lt%3b413210d2-56ae-41a4-873f-d171bb365...@gmail.com>
>
<mailto:413210d2-56ae-41a4-873f-d171bb365...@gmail.com
<mailto:413210d2-56ae-41a4-873f-d171bb365...@gmail.com>>>
> Content-Type: text/plain; charset="us-ascii"
>
> Dear Alex,
>
> Thank you for the prompt response.
>
> Are the eigenvectors stored in some variable ?
> Does the lda.scalings_ attribute contain the eigenvectors ?
>
> Best,
> Serafeim
>
> > On 16 Oct 2017, at 16:57, Alexandre Gramfort
> <alexandre.gramf...@inria.fr
<mailto:lt%3balexandre.gramf...@inria.fr>
<mailto:alexandre.gramf...@inria.fr
<mailto:alexandre.gramf...@inria.fr>>>
> wrote:
> >
> > no it stores the direction of the decision function
to match the
> API of
> > linear models.
> >
> > HTH
> > Alex
> >
> > On Mon, Oct 16, 2017 at 3:27 PM, Serafeim Loukas
> <seral...@gmail.com <mailto:lt%3bseral...@gmail.com>
<mailto:seral...@gmail.com <mailto:seral...@gmail.com>>>
wrote:
> >> Dear Scikit-learn community,
> >>
> >> Since the documentation of the LDA
> >>
>
(http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html
>
<http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html>)
> >> is not so clear, I would like to ask if the
lda.coef_ attribute
> stores the
> >> eigenvectors from the SVD decomposition.
> >>
> >> Thank you in advance,
> >> Serafeim
> >>
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn@python.org
<mailto:scikit-learn@python.org>
<mailto:scikit-learn@python.org
<mailto:scikit-learn@python.org>>
> >>
https://mail.python.org/mailman/listinfo/scikit-learn
>
<https://mail.python.org/mailman/listinfo/scikit-learn>
> >>
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn@python.org
<mailto:scikit-learn@python.org>
<mailto:scikit-learn@python.org
<mailto:scikit-learn@python.org>>
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
<https://mail.python.org/mailman/listinfo/scikit-learn>
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
>
<http://mail.python.org/pipermail/scikit-learn/attachments/20171016/505c7da3/attachment.html
>
<http://mail.python.org/pipermail/scikit-learn/attachments/20171016/505c7da3/attachment.html>>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org <mailto:scikit-learn@python.org>
<mailto:scikit-learn@python.org
<mailto:scikit-learn@python.org>>
> https://mail.python.org/mailman/listinfo/scikit-learn
>
<https://mail.python.org/mailman/listinfo/scikit-learn>
>
>
> ------------------------------
>
> End of scikit-learn Digest, Vol 19, Issue 25
> ********************************************
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org <mailto:scikit-learn@python.org>
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.python.org/pipermail/scikit-learn/attachments/20171016/f47e63a9/attachment.html>
------------------------------
Subject: Digest Footer
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org <mailto:scikit-learn@python.org>
https://mail.python.org/mailman/listinfo/scikit-learn
------------------------------
End of scikit-learn Digest, Vol 19, Issue 28
********************************************
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org <mailto:scikit-learn@python.org>
https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.python.org/pipermail/scikit-learn/attachments/20171016/620a9401/attachment.html>
------------------------------
Subject: Digest Footer
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org <mailto:scikit-learn@python.org>
https://mail.python.org/mailman/listinfo/scikit-learn
------------------------------
End of scikit-learn Digest, Vol 19, Issue 31
********************************************
--
Sent from a mobile phone and may contain errors
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn