On Sun, Nov 06, 2011 at 01:21:23PM -0500, Satrajit Ghosh wrote:
> what would be the theoretical or practical implications of computing the cv
> score by accumulating all test outcomes into a long vector (option 1) vs
> averaging per fold (option 2), especially when N's are small.

If I understand your question well, it basically amounts to how you
weigh the different fold, and whether variance, or other scalings of
error, are computed per fold or on the total concatenated data.

We used to have a parameter 'iid' in GridSearch and cross_val_score, that
would do the option 1. We got rid of it, because it was making the code
complicated and was fragile.

As far as the theoretical implications go, if your testing sets are not
too small, and your data are iid, the two should be asymptotically
equivalent. I do not know about the non-asymptotic results.

As an aside, this question does reveal the importance of having test sets
that are not too small. Very small test sets lead to poor estimation of
the generalization error. Also, it is better if the test sets are
balanced: they contain a good representation of the different possible
predictions. In general, I find that leave-one-out strategy should be
avoided. The best strategy would be something like a ShuffleSplit, with a
left-out fraction between .1 and .2, and as many folds as you are patient
enough to wait for.

HTH,

Gael

------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to