On Sun, Nov 06, 2011 at 01:21:23PM -0500, Satrajit Ghosh wrote: > what would be the theoretical or practical implications of computing the cv > score by accumulating all test outcomes into a long vector (option 1) vs > averaging per fold (option 2), especially when N's are small.
If I understand your question well, it basically amounts to how you weigh the different fold, and whether variance, or other scalings of error, are computed per fold or on the total concatenated data. We used to have a parameter 'iid' in GridSearch and cross_val_score, that would do the option 1. We got rid of it, because it was making the code complicated and was fragile. As far as the theoretical implications go, if your testing sets are not too small, and your data are iid, the two should be asymptotically equivalent. I do not know about the non-asymptotic results. As an aside, this question does reveal the importance of having test sets that are not too small. Very small test sets lead to poor estimation of the generalization error. Also, it is better if the test sets are balanced: they contain a good representation of the different possible predictions. In general, I find that leave-one-out strategy should be avoided. The best strategy would be something like a ShuffleSplit, with a left-out fraction between .1 and .2, and as many folds as you are patient enough to wait for. HTH, Gael ------------------------------------------------------------------------------ RSA(R) Conference 2012 Save $700 by Nov 18 Register now http://p.sf.net/sfu/rsa-sfdev2dev1 _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
