Re: [Scikit-learn-general] Calculating standard deviation for k-fold cross validation estimate

Sebastian Raschka Fri, 06 Feb 2015 05:39:07 -0800

Hi, Jason,

just wanted to clarify what I meant: Basically, I was trying to describe the 
overlap between the "rows" in this case. E.g., the number "0" would be 4 times 
in a training set across the different iterations. In k-fold and Stratified 
k-fold the the folds per iteration are always disjoint. The "sampling with 
replacement" would be called "bootstrapping."


Anyway, I am still curious about the multiplication, since you already have a 
large variance/bias ratio in k-fold cross validation in contrast to bootstrap.
Hopefully, the original author stumbles upon this Q & Q :)

Best,
Sebastian

> On Feb 6, 2015, at 4:33 AM, Jason Sanchez <jason.sanchez.m...@statefarm.com> 
> wrote:
> 
> My apologies. That was my first response to the mailing list and apparently I 
> copied the entire thing the first time. Hopefully this works.
> 
> Michael could be correct. In fact, I would be very interested in knowing the 
> name of the book he mentioned so I could learn more and anything else you 
> uncover!
> 
> In the interest of possibly adding something more to the discussion, 
> StratifiedKFold does not return overlapping folds (i.e. None of the 5 fold 
> created will have the same observations).
> 
> In: temp = StratifiedKFold(range(5)*10, 5).test_folds
>      temp
> Out: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 
> 2,
>       2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4,
>       4, 4, 4, 4])
> 
> In: for train, test in temp:
>       print train, test
> Out:
> [ 5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [0 1 2 3 4]
> [ 0  1  2  3  4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [5 6 7 8 9]
> [ 0  1  2  3  4  5  6  7  8  9 15 16 17 18 19 20 21 22 23 24] [10 11 12 13 14]
> [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 20 21 22 23 24] [15 16 17 18 19]
> [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19] [20 21 22 23 24]
> 
> Although the train data is overlapping between splits, the accuracy score is 
> calculated on the test folds (which do not overlap). If the test folds did 
> have repeated observations, then it would be immediately obvious why a 
> correction method would be needed; however, if they do not overlap, I cannot 
> immediately understand why a correction method would be needed. I would 
> appreciate any insight anyone has on the subject.
> 
> Best,
> Jason
> 
> ----------------------------------------------------------------------
> 
> Date: Fri, 6 Feb 2015 00:00:33 -0500
> From: Sebastian Raschka <se.rasc...@gmail.com>
> Subject: Re: [Scikit-learn-general] Calculating standard deviation    for
>       k-fold cross
> To: scikit-learn-general@lists.sourceforge.net
> Message-ID: <24cda88f-083e-4365-ae37-519f1f033...@gmail.com>
> Content-Type: text/plain; charset=us-ascii
> 
> Thanks for all your answers!
> Jason, I think you could be right, but the author wrote in the line above the 
> code
> 
> "The mean score and the standard deviation of the score estimate are hence 
> given by:"
> 
> So I assume he literally meant standard deviation to show how the scores 
> varies rather than showing how confident the mean score is.
> 
> Michael's suggestion makes most sense to me right now, but I have to dig 
> deeper into the literature here...
> 
>>>> this is most probably due to the fact that 2 = sqrt(5 - 1), a correction
>>>> to variance reduction incurred by the overlapping nature of the folds. the
>>>> bootstrap book contains more info on how to calculate these for different
>>>> cases of splitting.
>>>> 
>>>> hth,
>>>> michael
> 
> Although we have to be a little bit careful with the "overlaps" here since it 
> can be confused with "with replacement" like in boosting. So basically. here 
> only the folds overlap across the different iterations, but the "sqrt(5 - 1)" 
> makes sense.
> 
> Thanks for all your help!
> 
> Best,
> Sebastian
> 
> 
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming. The Go Parallel Website,
> sponsored by Intel and developed in partnership with Slashdot Media, is your
> hub for all things parallel software development, from weekly thought
> leadership blogs to news, videos, case studies, tutorials and more. Take a
> look and join the conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Calculating standard deviation for k-fold cross validation estimate

Reply via email to