My apologies. That was my first response to the mailing list and apparently I copied the entire thing the first time. Hopefully this works.
Michael could be correct. In fact, I would be very interested in knowing the name of the book he mentioned so I could learn more and anything else you uncover! In the interest of possibly adding something more to the discussion, StratifiedKFold does not return overlapping folds (i.e. None of the 5 fold created will have the same observations). In: temp = StratifiedKFold(range(5)*10, 5).test_folds temp Out: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]) In: for train, test in temp: print train, test Out: [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [0 1 2 3 4] [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [5 6 7 8 9] [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22 23 24] [10 11 12 13 14] [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 22 23 24] [15 16 17 18 19] [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] [20 21 22 23 24] Although the train data is overlapping between splits, the accuracy score is calculated on the test folds (which do not overlap). If the test folds did have repeated observations, then it would be immediately obvious why a correction method would be needed; however, if they do not overlap, I cannot immediately understand why a correction method would be needed. I would appreciate any insight anyone has on the subject. Best, Jason ---------------------------------------------------------------------- Date: Fri, 6 Feb 2015 00:00:33 -0500 From: Sebastian Raschka <se.rasc...@gmail.com> Subject: Re: [Scikit-learn-general] Calculating standard deviation for k-fold cross To: scikit-learn-general@lists.sourceforge.net Message-ID: <24cda88f-083e-4365-ae37-519f1f033...@gmail.com> Content-Type: text/plain; charset=us-ascii Thanks for all your answers! Jason, I think you could be right, but the author wrote in the line above the code "The mean score and the standard deviation of the score estimate are hence given by:" So I assume he literally meant standard deviation to show how the scores varies rather than showing how confident the mean score is. Michael's suggestion makes most sense to me right now, but I have to dig deeper into the literature here... >>> this is most probably due to the fact that 2 = sqrt(5 - 1), a correction >>> to variance reduction incurred by the overlapping nature of the folds. the >>> bootstrap book contains more info on how to calculate these for different >>> cases of splitting. >>> >>> hth, >>> michael Although we have to be a little bit careful with the "overlaps" here since it can be confused with "with replacement" like in boosting. So basically. here only the folds overlap across the different iterations, but the "sqrt(5 - 1)" makes sense. Thanks for all your help! Best, Sebastian ------------------------------------------------------------------------------ Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general