Hi all, I have found a way to make StratifiedKFold shuffle the samples ordering as few as possible in-order not underestimate the model overfitting caused by samples dependencies by default:
https://github.com/scikit-learn/scikit-learn/pull/2463 In general most CV schemes make the Independent and Identically Distributed samples assumption, hence the location of the splits should not significantly impact the estimated validation scores on average. However it is quite common that the collected samples are stored in temporal / acquisition order and in some case the IID assumption is only very partially met. This is the case in particular for the digits dataset that we include with scikit-learn (consecutive digits have a much higher likelihood to have been written by the same author than a pair of digits picked at random in the dataset). The current implementation is using an algorithmic trick that causes the samples to be shuffled (sorted by class labels) and hence break the dependencies and thus hide any overfitting problem that would be caused by dependencies. As StratifiedKFold is the default CV scheme used by `cross_val_score` and `GridSearchCV` I think it should not shuffle the samples by default to be able to detect those issues. If the caller is willing to explicitly make the IID assumption then he can always shuffle the data explicitly by using `StratifiedShuffleSplit` for instance. Back to the digits dataset: for a non optimal model SVC model trained on the first 1000 digits, the overfitting caused by the dependency can amount up to 10% of accuracy score as highlighted by a new test included in this PR. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99! 1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/22/13. http://pubads.g.doubleclick.net/gampad/clk?id=64545871&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general