Hi all,

I have found a way to make StratifiedKFold shuffle the samples
ordering as few as possible in-order not underestimate the model
overfitting caused by samples dependencies by default:

  https://github.com/scikit-learn/scikit-learn/pull/2463

In general most CV schemes make the Independent and Identically
Distributed samples assumption, hence the location of the splits
should not significantly impact the estimated validation scores on
average.

However it is quite common that the collected samples are stored in
temporal / acquisition order and in some case the IID assumption is
only very partially met. This is the case in particular for the digits
dataset that we include with scikit-learn (consecutive digits have a
much higher likelihood to have been written by the same author than a
pair of digits picked at random in the dataset).

The current implementation is using an algorithmic trick that causes
the samples to be shuffled (sorted by class labels) and hence break
the dependencies and thus hide any overfitting problem that would be
caused by dependencies.

As StratifiedKFold is the default CV scheme used by `cross_val_score`
and `GridSearchCV` I think it should not shuffle the samples by
default to be able to detect those issues.

If the caller is willing to explicitly make the IID assumption then he
can always shuffle the data explicitly by using
`StratifiedShuffleSplit` for instance.

Back to the digits dataset: for a non optimal model SVC model trained
on the first 1000 digits, the overfitting caused by the dependency can
amount up to 10% of accuracy score as highlighted by a new test
included in this PR.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/22/13. 
http://pubads.g.doubleclick.net/gampad/clk?id=64545871&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to