> It seems to be an interesting tool to me. We need to find a > non-trivial overfitting example that would run in an acceptable time > with the datasets available in the scikit.
Actually, those curves can be plot with respect to any parameter, not only the training set size. What comes to me is to use a decision tree and to plot the training and test curves with respect to max_depth or min_split (this is actually what I make my students do ;)). With min_split=1 for instance, you will get a fully developed tree with a perfect score on the training set (because of overfitting) but a quite bad accuracy on the test set. As you will increase min_split, the error on the test set will decrease (because the tree will no longer fit the noise, i.e., it will become less variant), reach an optimum, and then increase again (because the tree will become too simpler, i.e., too biased). You can do the same with any model (SVM wrt C, linear model wrt to the regularization factor, etc). Gilles ------------------------------------------------------------------------------ Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex infrastructure or vast IT resources to deliver seamless, secure access to virtual desktops. With this all-in-one solution, easily deploy virtual desktops for less than the cost of PCs and save 60% on VDI infrastructure costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
