Re: [Scikit-learn-general] Inconsistency between r2 scores

Olivier Grisel Fri, 05 Dec 2014 06:14:56 -0800

2014-12-02 8:09 GMT+01:00 Lei Gong <leig.in...@gmail.com>:
> Hi Michael and Andy,
>
> Thanks so much for your input. I think the cause of the problem is the iid
> assumption. I am new and did not know the default CV split the training set
> without shuffling. After add "cv = ShuffleSplit", everything seems fine.


Shuffling the samples before running the CV will not fix the iid
breakage problem, it will just hide it under the carpet by creating
easy to classify validation folds that artificial look alike the
training data. If the data generating process is not iid, the future
test data you will receive will stem from a difference distribution
and the cross-validation  score will be artificially much better than
the real test score that will be able to measure on test data.

Basically, shuffling makes it possible to (mistakenly) hide
overfitting issues caused by the IID breakage.

-- 
Olivier

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Inconsistency between r2 scores

Reply via email to