Wether this is a problem depends on whether the dataset creating process 
is really not IID,
or whether the creator of the dataset sorted it, which I think is a very 
common thing to do.


On 12/05/2014 09:13 AM, Olivier Grisel wrote:
> 2014-12-02 8:09 GMT+01:00 Lei Gong <leig.in...@gmail.com>:
>> Hi Michael and Andy,
>>
>> Thanks so much for your input. I think the cause of the problem is the iid
>> assumption. I am new and did not know the default CV split the training set
>> without shuffling. After add "cv = ShuffleSplit", everything seems fine.
> Shuffling the samples before running the CV will not fix the iid
> breakage problem, it will just hide it under the carpet by creating
> easy to classify validation folds that artificial look alike the
> training data. If the data generating process is not iid, the future
> test data you will receive will stem from a difference distribution
> and the cross-validation  score will be artificially much better than
> the real test score that will be able to measure on test data.
>
> Basically, shuffling makes it possible to (mistakenly) hide
> overfitting issues caused by the IID breakage.
>


------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to