Re: [Scikit-learn-general] Inconsistency between r2 scores

Lei Gong Mon, 01 Dec 2014 23:11:22 -0800

Hi Michael and Andy,

Thanks so much for your input. I think the cause of the problem is the iid 
assumption. I am new and did not know the default CV split the training set 
without shuffling. After add "cv = ShuffleSplit", everything seems fine.


And the reason why r2 on the test set is ok at the first place is because the 
test set was generated as a random sample from the original set.

Cheers,
Lei  

—
Sent from a not-so-smartphone.  

On Mon, Dec 01, 2014 at 11:01 PM, Michael Eickenberg 
<michael.eickenb...@gmail.com(mailto:michael.eickenb...@gmail.com|28)> wrote:


> Dear Lei,  
>  
> would it possible for you to report the same numbers for a target set Y which 
> is standardized? If Ytest has variance 1, then the MSE (not RMSE) and r2 
> should sum to about 1. (it is not always possible to jsutify this step on all 
> types of data)  
>  
> If you standardize all of Ytrain, you can check Andy's very important comment 
> on iid-ness in an around way: if the variances of your contiguous folds 
> differ strongly from 1 after global normalization, then your data may be 
> drifting somehow.  
>  
> Michael  
>  
>  
> On Tuesday, December 2, 2014, Andy 
> <t3k...@gmail.com(mailto:t3k...@gmail.com|16)> wrote:
> > Hi Lei.
> >  
> > If the CV score and the test set score are very different, that suggests 
> > that the IID assumption is violated.
> > I think be default cross_val_score does not shuffle, which can be an issue 
> > if your data is sorted in some way.
> >  
> > The fact that rmse doesn't show this might just tell you that the rmse 
> > doesn't really capture the correlation here.
> > A negative R2 on the cross-validation means you are not learning anything 
> > basically.
> >  
> > Cheers,
> > Andy
> >  
> >  
> >  
> > On 12/01/2014 05:30 PM, Lei Gong wrote:
> > > Hey all,  
> > >  
> > >  
> > > First of all, I want to thank you for this awesome project.  
> > >  
> > > I am working on a project where I want to fit a linear regression to make 
> > > some predictions. The dataset was split into training/test (70/30). I 
> > > then applied 10-fold CV on the training set and made predictions on the 
> > > test set. It is not a particular complex problem so I would expect the 
> > > estimated RMSE and R2 from 10-fold CV and test set to be reasonably close 
> > > with each other.  
> > >  
> > > It turns out that the estimated RMSE are quite close: "CV 0.7435" versus 
> > > “test set 0.7429”. However, I found the two R2 scores are as follows: “CV 
> > > -3.0168” versus “test set 0.8718”. I can live with the negative R2, but I 
> > > am confused by this inconsistency. I wonder if anyone can help. Thank you 
> > > in advance.  
> > >  
> > > =================Here is my script=================  
> > >  
> > > from sklearn.linear_model import LinearRegression  
> > > from sklearn.cross_validation import cross_val_score
> > >  
> > > lm = LinearRegression()  
> > > train_scores_mse = cross_val_score(lm, trainX_trans_filtered, trainY, 
> > > cv=10,  
> > > scoring = 'mean_squared_error')
> > > train_scores_rmse = np.sqrt(-1.0 * train_scores_mse)
> > > train_scores_r2 = cross_val_score(lm, trainX_trans_filtered, trainY, 
> > > cv=10,  
> > > scoring = 'r2')
> > > print "CV estimated RMSE: {0} \nCV estimated R2: 
> > > {1}".format(np.mean(train_scores_rmse), np.mean(train_scores_r2))
> > >  
> > > CV estimated RMSE: 0.743556872074 CV estimated R2: -3.01685516116  
> > >  
> > > # apply to the test set lm.fit(trainX_trans_filtered, trainY) testY_pred 
> > > = lm.predict(testX_trans_filtered)from sklearn.metrics import r2_score, 
> > > mean_squared_error  
> > > test_score_r2 = r2_score(testY, testY_pred) test_score_rmse = 
> > > np.sqrt(mean_squared_error(testY, testY_pred)) print "Test set RMSE: {0} 
> > > \nTest set R2: {1}".format(test_score_rmse, test_score_r2)  
> > >  
> > > Test set RMSE: 0.742917835704 Test set R2: 0.871834926473  
> > >  
> > >  
> > >  
> > > Cheers,  
> > >  
> > > Lei  
> > >  
> > >  
> > >  
> > >  
> > >  
> > >  
> > > ------------------------------------------------------------------------------
> > >  Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from 
> > > Actuate! Instantly Supercharge Your Business Reports and Dashboards with 
> > > Interactivity, Sharing, Native Excel Exports, App Integration & more Get 
> > > technology previously reserved for billion-dollar corporations, FREE 
> > > http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
> > >   
> > >  
> > > _______________________________________________ Scikit-learn-general 
> > > mailing list 
> > > Scikit-learn-general@lists.sourceforge.net(javascript:_e(%7B%7D,'cvml','Scikit-learn-general@lists.sourceforge.net');|42)
> > >  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general  
> ------------------------------------------------------------------------------
> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
> with Interactivity, Sharing, Native Excel Exports, App Integration & more
> Get technology previously reserved for billion-dollar corporations, FREE
> http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Inconsistency between r2 scores

Reply via email to