Re: [Scikit-learn-general] Inconsistency between r2 scores

Andy Mon, 01 Dec 2014 15:14:31 -0800

 Hi Lei.

If the CV score and the test set score are very different, that suggeststhat the IID assumption is violated.I think be default cross_val_score does not shuffle, which can be anissue if your data is sorted in some way.

The fact that rmse doesn't show this might just tell you that the rmsedoesn't really capture the correlation here.A negative R2 on the cross-validation means you are not learninganything basically.


Cheers,
Andy



On 12/01/2014 05:30 PM, Lei Gong wrote:

Hey all,


First of all, I want to thank you for this awesome project.
I am working on a project where I want to fit a linear regression tomake some predictions. The dataset was split into training/test(70/30). I then applied 10-fold CV on the training set and madepredictions on the test set. It is not a particular complex problem soI would expect the estimated RMSE and R2 from 10-fold CV and test setto be reasonably close with each other.
It turns out that the estimated RMSE are quite close: "CV 0.7435"versus “test set 0.7429”. However, I found the two R2 scores are asfollows: “CV -3.0168” versus “test set 0.8718”. I can live with thenegative R2, but I am confused by this inconsistency. I wonder ifanyone can help. Thank you in advance.
=================Here is my script=================

from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import cross_val_score

lm = LinearRegression()
train_scores_mse = cross_val_score(lm, trainX_trans_filtered, trainY,cv=10,
scoring = 'mean_squared_error')
train_scores_rmse = np.sqrt(-1.0 * train_scores_mse)
train_scores_r2 = cross_val_score(lm, trainX_trans_filtered, trainY,cv=10,
scoring = 'r2')
print "CV estimated RMSE: {0} \nCV estimated R2:{1}".format(np.mean(train_scores_rmse), np.mean(train_scores_r2))
CV estimated RMSE: 0.743556872074
CV estimated R2: -3.01685516116
# apply to the test set
lm.fit(trainX_trans_filtered, trainY)
testY_pred = lm.predict(testX_trans_filtered)from sklearn.metrics import 
r2_score, mean_squared_error
test_score_r2 = r2_score(testY, testY_pred)
test_score_rmse = np.sqrt(mean_squared_error(testY, testY_pred))
print "Test set RMSE: {0} \nTest set R2: {1}".format(test_score_rmse, 
test_score_r2)
Test set RMSE: 0.742917835704
Test set R2: 0.871834926473
Cheers,
Lei



------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk


_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Inconsistency between r2 scores

Reply via email to