2014-12-02 8:09 GMT+01:00 Lei Gong <leig.in...@gmail.com>: > Hi Michael and Andy, > > Thanks so much for your input. I think the cause of the problem is the iid > assumption. I am new and did not know the default CV split the training set > without shuffling. After add "cv = ShuffleSplit", everything seems fine.
Shuffling the samples before running the CV will not fix the iid breakage problem, it will just hide it under the carpet by creating easy to classify validation folds that artificial look alike the training data. If the data generating process is not iid, the future test data you will receive will stem from a difference distribution and the cross-validation score will be artificially much better than the real test score that will be able to measure on test data. Basically, shuffling makes it possible to (mistakenly) hide overfitting issues caused by the IID breakage. -- Olivier ------------------------------------------------------------------------------ Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration & more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general