Hi Lei.

If the CV score and the test set score are very different, that suggests that the IID assumption is violated. I think be default cross_val_score does not shuffle, which can be an issue if your data is sorted in some way.

The fact that rmse doesn't show this might just tell you that the rmse doesn't really capture the correlation here. A negative R2 on the cross-validation means you are not learning anything basically.

Cheers,
Andy



On 12/01/2014 05:30 PM, Lei Gong wrote:
Hey all,


First of all, I want to thank you for this awesome project.

I am working on a project where I want to fit a linear regression to make some predictions. The dataset was split into training/test (70/30). I then applied 10-fold CV on the training set and made predictions on the test set. It is not a particular complex problem so I would expect the estimated RMSE and R2 from 10-fold CV and test set to be reasonably close with each other.

It turns out that the estimated RMSE are quite close: "CV 0.7435" versus “test set 0.7429”. However, I found the two R2 scores are as follows: “CV -3.0168” versus “test set 0.8718”. I can live with the negative R2, but I am confused by this inconsistency. I wonder if anyone can help. Thank you in advance.

=================Here is my script=================

from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import cross_val_score

lm = LinearRegression()
train_scores_mse = cross_val_score(lm, trainX_trans_filtered, trainY, cv=10,
scoring = 'mean_squared_error')
train_scores_rmse = np.sqrt(-1.0 * train_scores_mse)
train_scores_r2 = cross_val_score(lm, trainX_trans_filtered, trainY, cv=10,
scoring = 'r2')
print "CV estimated RMSE: {0} \nCV estimated R2: {1}".format(np.mean(train_scores_rmse), np.mean(train_scores_r2))
CV estimated RMSE: 0.743556872074
CV estimated R2: -3.01685516116
# apply to the test set
lm.fit(trainX_trans_filtered, trainY)
testY_pred = lm.predict(testX_trans_filtered)from sklearn.metrics import 
r2_score, mean_squared_error
test_score_r2 = r2_score(testY, testY_pred)
test_score_rmse = np.sqrt(mean_squared_error(testY, testY_pred))
print "Test set RMSE: {0} \nTest set R2: {1}".format(test_score_rmse, 
test_score_r2)
Test set RMSE: 0.742917835704
Test set R2: 0.871834926473
Cheers,
Lei



------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk


_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to