On Fri, Feb 6, 2015 at 10:33 AM, Jason Sanchez <
jason.sanchez.m...@statefarm.com> wrote:
> My apologies. That was my first response to the mailing list and
> apparently I copied the entire thing the first time. Hopefully this works.
>
> Michael could be correct. In fact, I would be very interested in knowing
> the name of the book he mentioned so I could learn more and anything else
> you uncover!
>
I was referring to the Efron/Tibshirani book (
http://books.google.fr/books/about/An_Introduction_to_the_Bootstrap.html?id=gLlpIUxRntoC&redir_esc=y).
I do not have it at hand unfortunately, so I can't look up the relevant
part.
My answer was just an educated guess: I am not sure to what extent these
variance adjustments actually apply in machine learning settings, but
people love to use them nevertheless. At the same time, it can't harm,
because it makes your affirmation weaker.
>
> In the interest of possibly adding something more to the discussion,
> StratifiedKFold does not return overlapping folds (i.e. None of the 5 fold
> created will have the same observations).
>
In: temp = StratifiedKFold(range(5)*10, 5).test_folds
> temp
> Out: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,
> 2, 2,
> 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4,
> 4, 4, 4, 4])
>
> In: for train, test in temp:
> print train, test
> Out:
> [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [0 1 2 3 4]
> [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [5 6 7 8 9]
> [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22 23 24] [10 11 12 13
> 14]
> [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 22 23 24] [15 16 17 18
> 19]
> [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] [20 21 22 23
> 24]
>
> Although the train data is overlapping between splits, the accuracy score
> is calculated on the test folds (which do not overlap). If the test folds
> did have repeated observations, then it would be immediately obvious why a
> correction method would be needed; however, if they do not overlap, I
> cannot immediately understand why a correction method would be needed. I
> would appreciate any insight anyone has on the subject.
>
>
This goes along the lines of Joel's comment. I do not have an answer to
this. On the one hand, yes, you are evaluating on disjoint test sets (but
that is already the case with KFold, no need to stratify). However, the
models are learnt on shared data
and thus are similar. It probably depends on which sources of variance you
are considering.
I too would be very interested in any insight here.
Michael
------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general