2011/12/30 Gilles Louppe <[email protected]>: >> It seems to be an interesting tool to me. We need to find a >> non-trivial overfitting example that would run in an acceptable time >> with the datasets available in the scikit. > > Actually, those curves can be plot with respect to any parameter, not > only the training set size. > > What comes to me is to use a decision tree and to plot the training > and test curves with respect to max_depth or min_split (this is > actually what I make my students do ;)). With min_split=1 for > instance, you will get a fully developed tree with a perfect score on > the training set (because of overfitting) but a quite bad accuracy on > the test set. As you will increase min_split, the error on the test > set will decrease (because the tree will no longer fit the noise, > i.e., it will become less variant), reach an optimum, and then > increase again (because the tree will become too simpler, i.e., too > biased). > > You can do the same with any model (SVM wrt C, linear model wrt to > the regularization factor, etc).
Yes this is the traditional model selection curve, e.g. for regularized linear regression: http://scikit-learn.org/dev/auto_examples/linear_model/plot_lasso_model_selection.html What I find interesting with the training data size curves it that it gives a hint on whether adding more labeled data will help or not. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex infrastructure or vast IT resources to deliver seamless, secure access to virtual desktops. With this all-in-one solution, easily deploy virtual desktops for less than the cost of PCs and save 60% on VDI infrastructure costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
