Wether this is a problem depends on whether the dataset creating process is really not IID, or whether the creator of the dataset sorted it, which I think is a very common thing to do.
On 12/05/2014 09:13 AM, Olivier Grisel wrote: > 2014-12-02 8:09 GMT+01:00 Lei Gong <leig.in...@gmail.com>: >> Hi Michael and Andy, >> >> Thanks so much for your input. I think the cause of the problem is the iid >> assumption. I am new and did not know the default CV split the training set >> without shuffling. After add "cv = ShuffleSplit", everything seems fine. > Shuffling the samples before running the CV will not fix the iid > breakage problem, it will just hide it under the carpet by creating > easy to classify validation folds that artificial look alike the > training data. If the data generating process is not iid, the future > test data you will receive will stem from a difference distribution > and the cross-validation score will be artificially much better than > the real test score that will be able to measure on test data. > > Basically, shuffling makes it possible to (mistakenly) hide > overfitting issues caused by the IID breakage. > ------------------------------------------------------------------------------ Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration & more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general