thanks very much gael. unfortunately, even using 5-fold cross-validation
will still result in a pretty small test set. the N is pretty small. i'm
actually using a stratifiedkfold with as large a test set as i can get
without blowing the variance of the model through the roof.
cheers,
satra
On Sun, Nov 6, 2011 at 3:15 PM, Gael Varoquaux <
[email protected]> wrote:
> On Sun, Nov 06, 2011 at 01:21:23PM -0500, Satrajit Ghosh wrote:
> > what would be the theoretical or practical implications of computing the
> cv
> > score by accumulating all test outcomes into a long vector (option 1) vs
> > averaging per fold (option 2), especially when N's are small.
>
> If I understand your question well, it basically amounts to how you
> weigh the different fold, and whether variance, or other scalings of
> error, are computed per fold or on the total concatenated data.
>
> We used to have a parameter 'iid' in GridSearch and cross_val_score, that
> would do the option 1. We got rid of it, because it was making the code
> complicated and was fragile.
>
> As far as the theoretical implications go, if your testing sets are not
> too small, and your data are iid, the two should be asymptotically
> equivalent. I do not know about the non-asymptotic results.
>
> As an aside, this question does reveal the importance of having test sets
> that are not too small. Very small test sets lead to poor estimation of
> the generalization error. Also, it is better if the test sets are
> balanced: they contain a good representation of the different possible
> predictions. In general, I find that leave-one-out strategy should be
> avoided. The best strategy would be something like a ShuffleSplit, with a
> left-out fraction between .1 and .2, and as many folds as you are patient
> enough to wait for.
>
> HTH,
>
> Gael
>
>
> ------------------------------------------------------------------------------
> RSA(R) Conference 2012
> Save $700 by Nov 18
> Register now
> http://p.sf.net/sfu/rsa-sfdev2dev1
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general