My apologies. That was my first response to the mailing list and apparently I 
copied the entire thing the first time. Hopefully this works.

Michael could be correct. In fact, I would be very interested in knowing the 
name of the book he mentioned so I could learn more and anything else you 
uncover!

In the interest of possibly adding something more to the discussion, 
StratifiedKFold does not return overlapping folds (i.e. None of the 5 fold 
created will have the same observations).

In: temp = StratifiedKFold(range(5)*10, 5).test_folds
      temp
Out: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4])

In: for train, test in temp:
        print train, test
Out:
[ 5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [0 1 2 3 4]
[ 0  1  2  3  4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [5 6 7 8 9]
[ 0  1  2  3  4  5  6  7  8  9 15 16 17 18 19 20 21 22 23 24] [10 11 12 13 14]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 20 21 22 23 24] [15 16 17 18 19]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19] [20 21 22 23 24]

Although the train data is overlapping between splits, the accuracy score is 
calculated on the test folds (which do not overlap). If the test folds did have 
repeated observations, then it would be immediately obvious why a correction 
method would be needed; however, if they do not overlap, I cannot immediately 
understand why a correction method would be needed. I would appreciate any 
insight anyone has on the subject.

Best,
Jason

----------------------------------------------------------------------

Date: Fri, 6 Feb 2015 00:00:33 -0500
From: Sebastian Raschka <se.rasc...@gmail.com>
Subject: Re: [Scikit-learn-general] Calculating standard deviation      for
        k-fold cross
To: scikit-learn-general@lists.sourceforge.net
Message-ID: <24cda88f-083e-4365-ae37-519f1f033...@gmail.com>
Content-Type: text/plain; charset=us-ascii

Thanks for all your answers!
Jason, I think you could be right, but the author wrote in the line above the 
code

 "The mean score and the standard deviation of the score estimate are hence 
given by:"

So I assume he literally meant standard deviation to show how the scores varies 
rather than showing how confident the mean score is.

Michael's suggestion makes most sense to me right now, but I have to dig deeper 
into the literature here...

>>> this is most probably due to the fact that 2 = sqrt(5 - 1), a correction
>>> to variance reduction incurred by the overlapping nature of the folds. the
>>> bootstrap book contains more info on how to calculate these for different
>>> cases of splitting.
>>> 
>>> hth,
>>> michael

Although we have to be a little bit careful with the "overlaps" here since it 
can be confused with "with replacement" like in boosting. So basically. here 
only the folds overlap across the different iterations, but the "sqrt(5 - 1)" 
makes sense.

Thanks for all your help!

Best,
Sebastian


------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to