Re: [Scikit-learn-general] Calculating standard deviation for k-fold cross

Sebastian Raschka Thu, 05 Feb 2015 21:01:25 -0800

Thanks for all your answers!
Jason, I think you could be right, but the author wrote in the line above the 
code


 "The mean score and the standard deviation of the score estimate are hence 
given by:"

So I assume he literally meant standard deviation to show how the scores varies 
rather than showing how confident the mean score is.

Michael's suggestion makes most sense to me right now, but I have to dig deeper 
into the literature here...

>>> this is most probably due to the fact that 2 = sqrt(5 - 1), a correction
>>> to variance reduction incurred by the overlapping nature of the folds. the
>>> bootstrap book contains more info on how to calculate these for different
>>> cases of splitting.
>>> 
>>> hth,
>>> michael

Although we have to be a little bit careful with the "overlaps" here since it 
can be confused with "with replacement" like in boosting. So basically. here 
only the folds overlap across the different iterations, but the "sqrt(5 - 1)" 
makes sense.

Thanks for all your help!

Best,
Sebastian


> On Feb 5, 2015, at 11:32 PM, Jason Sanchez <[email protected]> 
> wrote:
> 
> This is a very common calculation, you will find it at all of these places 
> (but only with one standard deviation):
> http://scikit-learn.org/stable/auto_examples/randomized_search.html
> http://nbviewer.ipython.org/github/gmonce/scikit-learn-book/blob/master/Chapter%202%20-%20Supervised%20Learning%20-%20Image%20Recognition%20with%20Support%20Vector%20Machines.ipynb
> http://youtu.be/iFkRt3BCctg?t=33m25s
> 
> I would presume that standard deviation is multiplied by two because the 
> author of the example wanted to create confidence intervals based on two 
> standard deviations. Technically, if they multiplied it by 1.96, then they 
> would approximate the famous 95% confidence interval better, but 2 standard 
> deviations is often used for simplicity.
> 
> http://en.wikipedia.org/wiki/1.96
> 
> Best,
> Jason
> 
> 
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] 
> Sent: Thursday, February 05, 2015 4:11 PM
> To: [email protected]
> Subject: Scikit-learn-general Digest, Vol 61, Issue 8
> 
> Send Scikit-learn-general mailing list submissions to
>       [email protected]
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>       https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> or, via email, send a message with subject or body 'help' to
>       [email protected]
> 
> You can reach the person managing the list at
>       [email protected]
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Scikit-learn-general digest..."
> 
> 
> Today's Topics:
> 
>   1. Re: Calculating standard deviation for k-fold cross
>      validation estimate (Michael Eickenberg)
>   2. Re: GSoC2015 topics (Joel Nothman)
>   3. Re: Calculating standard deviation for k-fold cross
>      validation estimate (Joel Nothman)
>   4. Re: Calculating standard deviation for k-fold cross
>      validation estimate (Kyle Kastner)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Thu, 5 Feb 2015 20:44:16 +0100
> From: Michael Eickenberg <[email protected]>
> Subject: Re: [Scikit-learn-general] Calculating standard deviation for
>       k-fold cross validation estimate
> To: "[email protected]"
>       <[email protected]>
> Message-ID:
>       <cadxjn660qzvxs+ui+cskezdzwskqh_9gagtl-opqwivha-g...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
> 
> this is most probably due to the fact that 2 = sqrt(5 - 1), a correction to
> variance reduction incurred by the overlapping nature of the folds. the
> bootstrap book contains more info on how to calculate these for different
> cases of splitting.
> 
> hth,
> michael
> 
> On Thursday, February 5, 2015, Sebastian Raschka <[email protected]>
> wrote:
> 
>> Hi,
>> 
>> I am wondering why the standard deviation of the accuracy estimate is
>> multiplied by 2 in the example on
>> http://scikit-learn.org/stable/modules/cross_validation.html; it would be
>> nice if someone could explain it to me.
>> 
>> The relevant excerpt from the page linked above:
>> 
>>>>> clf = svm.SVC(kernel='linear', C=1)
>>>>> scores = cross_validation.cross_val_score(
>> ... clf, iris.data, iris.target, cv=5)
>> ...
>>>>> scores
>> array([ 0.96..., 1. ..., 0.96..., 0.96..., 1. ])
>> The mean score and the standard deviation of the score estimate are hence
>> given by:
>>>>> 
>>>>> print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() *
>> 2))
>> Accuracy: 0.98 (+/- 0.03)
>> 
>> 
>> Best,
>> Sebastian
>> 
>> ------------------------------------------------------------------------------
>> Dive into the World of Parallel Programming. The Go Parallel Website,
>> sponsored by Intel and developed in partnership with Slashdot Media, is
>> your
>> hub for all things parallel software development, from weekly thought
>> leadership blogs to news, videos, case studies, tutorials and more. Take a
>> look and join the conversation now. http://goparallel.sourceforge.net/
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected] <javascript:;>
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> 
> -------------- next part --------------
> An HTML attachment was scrubbed...
> 
> ------------------------------
> 
> Message: 2
> Date: Fri, 6 Feb 2015 08:52:31 +1100
> From: Joel Nothman <[email protected]>
> Subject: Re: [Scikit-learn-general] GSoC2015 topics
> To: scikit-learn-general <[email protected]>
> Message-ID:
>       <caakaflw8xun0yp_wgwn-x8wgwbtybqvny38-pqeej8e2hd-...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
> 
>> I think adding partial_fit functions in general to as many algorithms as
> possible would be nice
> 
> Which could be a project in itself, for someone open to breadth rather than
> depth.
> 
> On 6 February 2015 at 06:43, Kyle Kastner <[email protected]> wrote:
> 
>> IncrementalPCA is done (have to add randomized SVD solver but that should
>> be simple), but I am sure there are other low rank methods which need a
>> partial_fit . I think adding partial_fit functions in general to as many
>> algorithms as possible would be nice
>> 
>> Kyle
>> 
>> On Thu, Feb 5, 2015 at 2:12 PM, Akshay Narasimha <[email protected]
>>> wrote:
>> 
>>> Is Online low rank factorisation still a vaild idea for this year? As it
>>> was in the last years idea list.
>>> 
>>> On Thu, Feb 5, 2015 at 9:49 PM, Alexandre Gramfort <
>>> [email protected]> wrote:
>>> 
>>>>> I just looked at the list from last year, and what seems most relevant
>>>>> still is GMMs,
>>>>> and possibly the coordinate descent solvers (Alex maybe you can say
>>>> what
>>>>> is left there or
>>>>> if with the SAG we are happy now?)
>>>> 
>>>> there is work coming in coordinate descent and SAG is almost done.
>>>> I don't think it's worth investing a gsoc on this topic.
>>>> 
>>>> Alex
>>>> 
>>>> 
>>>> ------------------------------------------------------------------------------
>>>> Dive into the World of Parallel Programming. The Go Parallel Website,
>>>> sponsored by Intel and developed in partnership with Slashdot Media, is
>>>> your
>>>> hub for all things parallel software development, from weekly thought
>>>> leadership blogs to news, videos, case studies, tutorials and more. Take
>>>> a
>>>> look and join the conversation now. http://goparallel.sourceforge.net/
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>> 
>>> 
>>> 
>>> 
>>> ------------------------------------------------------------------------------
>>> Dive into the World of Parallel Programming. The Go Parallel Website,
>>> sponsored by Intel and developed in partnership with Slashdot Media, is
>>> your
>>> hub for all things parallel software development, from weekly thought
>>> leadership blogs to news, videos, case studies, tutorials and more. Take a
>>> look and join the conversation now. http://goparallel.sourceforge.net/
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>> 
>>> 
>> 
>> 
>> ------------------------------------------------------------------------------
>> Dive into the World of Parallel Programming. The Go Parallel Website,
>> sponsored by Intel and developed in partnership with Slashdot Media, is
>> your
>> hub for all things parallel software development, from weekly thought
>> leadership blogs to news, videos, case studies, tutorials and more. Take a
>> look and join the conversation now. http://goparallel.sourceforge.net/
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> 
>> 
> -------------- next part --------------
> An HTML attachment was scrubbed...
> 
> ------------------------------
> 
> Message: 3
> Date: Fri, 6 Feb 2015 08:54:12 +1100
> From: Joel Nothman <[email protected]>
> Subject: Re: [Scikit-learn-general] Calculating standard deviation for
>       k-fold cross validation estimate
> To: scikit-learn-general <[email protected]>
> Message-ID:
>       <caakaflvu_-3krs31cunfbu3rd1sowtqun33rvqnyzvirrnl...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
> 
> With cv=5, only the training sets should overlap. Is this adjustment still
> appropriate?
> 
> On 6 February 2015 at 06:44, Michael Eickenberg <
> [email protected]> wrote:
> 
>> this is most probably due to the fact that 2 = sqrt(5 - 1), a correction
>> to variance reduction incurred by the overlapping nature of the folds. the
>> bootstrap book contains more info on how to calculate these for different
>> cases of splitting.
>> 
>> hth,
>> michael
>> 
>> 
>> On Thursday, February 5, 2015, Sebastian Raschka <[email protected]>
>> wrote:
>> 
>>> Hi,
>>> 
>>> I am wondering why the standard deviation of the accuracy estimate is
>>> multiplied by 2 in the example on
>>> http://scikit-learn.org/stable/modules/cross_validation.html; it would
>>> be nice if someone could explain it to me.
>>> 
>>> The relevant excerpt from the page linked above:
>>> 
>>>>>> clf = svm.SVC(kernel='linear', C=1)
>>>>>> scores = cross_validation.cross_val_score(
>>> ... clf, iris.data, iris.target, cv=5)
>>> ...
>>>>>> scores
>>> array([ 0.96..., 1. ..., 0.96..., 0.96..., 1. ])
>>> The mean score and the standard deviation of the score estimate are hence
>>> given by:
>>>>>> 
>>>>>> print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() *
>>> 2))
>>> Accuracy: 0.98 (+/- 0.03)
>>> 
>>> 
>>> Best,
>>> Sebastian
>>> 
>>> ------------------------------------------------------------------------------
>>> Dive into the World of Parallel Programming. The Go Parallel Website,
>>> sponsored by Intel and developed in partnership with Slashdot Media, is
>>> your
>>> hub for all things parallel software development, from weekly thought
>>> leadership blogs to news, videos, case studies, tutorials and more. Take a
>>> look and join the conversation now. http://goparallel.sourceforge.net/
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>> 
>> 
>> 
>> ------------------------------------------------------------------------------
>> Dive into the World of Parallel Programming. The Go Parallel Website,
>> sponsored by Intel and developed in partnership with Slashdot Media, is
>> your
>> hub for all things parallel software development, from weekly thought
>> leadership blogs to news, videos, case studies, tutorials and more. Take a
>> look and join the conversation now. http://goparallel.sourceforge.net/
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> 
>> 
> -------------- next part --------------
> An HTML attachment was scrubbed...
> 
> ------------------------------
> 
> Message: 4
> Date: Thu, 5 Feb 2015 17:11:00 -0500
> From: Kyle Kastner <[email protected]>
> Subject: Re: [Scikit-learn-general] Calculating standard deviation for
>       k-fold cross validation estimate
> To: [email protected]
> Message-ID:
>       <CAGNZ19BYpHQS1zrKLAShgGEF=echmkw5erwwulxodm6pp57...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
> 
> Could it also be accounting for +- ? Standard deviation is one sided right?
> 
> On Thu, Feb 5, 2015 at 4:54 PM, Joel Nothman <[email protected]> wrote:
> 
>> With cv=5, only the training sets should overlap. Is this adjustment still
>> appropriate?
>> 
>> On 6 February 2015 at 06:44, Michael Eickenberg <
>> [email protected]> wrote:
>> 
>>> this is most probably due to the fact that 2 = sqrt(5 - 1), a correction
>>> to variance reduction incurred by the overlapping nature of the folds. the
>>> bootstrap book contains more info on how to calculate these for different
>>> cases of splitting.
>>> 
>>> hth,
>>> michael
>>> 
>>> 
>>> On Thursday, February 5, 2015, Sebastian Raschka <[email protected]>
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I am wondering why the standard deviation of the accuracy estimate is
>>>> multiplied by 2 in the example on
>>>> http://scikit-learn.org/stable/modules/cross_validation.html; it would
>>>> be nice if someone could explain it to me.
>>>> 
>>>> The relevant excerpt from the page linked above:
>>>> 
>>>>>>> clf = svm.SVC(kernel='linear', C=1)
>>>>>>> scores = cross_validation.cross_val_score(
>>>> ... clf, iris.data, iris.target, cv=5)
>>>> ...
>>>>>>> scores
>>>> array([ 0.96..., 1. ..., 0.96..., 0.96..., 1. ])
>>>> The mean score and the standard deviation of the score estimate are
>>>> hence given by:
>>>>>>> 
>>>>>>> print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() *
>>>> 2))
>>>> Accuracy: 0.98 (+/- 0.03)
>>>> 
>>>> 
>>>> Best,
>>>> Sebastian
>>>> 
>>>> ------------------------------------------------------------------------------
>>>> Dive into the World of Parallel Programming. The Go Parallel Website,
>>>> sponsored by Intel and developed in partnership with Slashdot Media, is
>>>> your
>>>> hub for all things parallel software development, from weekly thought
>>>> leadership blogs to news, videos, case studies, tutorials and more. Take
>>>> a
>>>> look and join the conversation now. http://goparallel.sourceforge.net/
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>> 
>>> 
>>> 
>>> ------------------------------------------------------------------------------
>>> Dive into the World of Parallel Programming. The Go Parallel Website,
>>> sponsored by Intel and developed in partnership with Slashdot Media, is
>>> your
>>> hub for all things parallel software development, from weekly thought
>>> leadership blogs to news, videos, case studies, tutorials and more. Take a
>>> look and join the conversation now. http://goparallel.sourceforge.net/
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>> 
>>> 
>> 
>> 
>> ------------------------------------------------------------------------------
>> Dive into the World of Parallel Programming. The Go Parallel Website,
>> sponsored by Intel and developed in partnership with Slashdot Media, is
>> your
>> hub for all things parallel software development, from weekly thought
>> leadership blogs to news, videos, case studies, tutorials and more. Take a
>> look and join the conversation now. http://goparallel.sourceforge.net/
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> 
>> 
> -------------- next part --------------
> An HTML attachment was scrubbed...
> 
> ------------------------------
> 
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming. The Go Parallel Website,
> sponsored by Intel and developed in partnership with Slashdot Media, is your
> hub for all things parallel software development, from weekly thought
> leadership blogs to news, videos, case studies, tutorials and more. Take a
> look and join the conversation now. http://goparallel.sourceforge.net/
> 
> ------------------------------
> 
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 
> 
> End of Scikit-learn-general Digest, Vol 61, Issue 8
> ***************************************************
> 
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming. The Go Parallel Website,
> sponsored by Intel and developed in partnership with Slashdot Media, is your
> hub for all things parallel software development, from weekly thought
> leadership blogs to news, videos, case studies, tutorials and more. Take a
> look and join the conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Calculating standard deviation for k-fold cross

Reply via email to