Hi Lei.
If the CV score and the test set score are very different, that suggests
that the IID assumption is violated.
I think be default cross_val_score does not shuffle, which can be an
issue if your data is sorted in some way.
The fact that rmse doesn't show this might just tell you that the rmse
doesn't really capture the correlation here.
A negative R2 on the cross-validation means you are not learning
anything basically.
Cheers,
Andy
On 12/01/2014 05:30 PM, Lei Gong wrote:
Hey all,
First of all, I want to thank you for this awesome project.
I am working on a project where I want to fit a linear regression to
make some predictions. The dataset was split into training/test
(70/30). I then applied 10-fold CV on the training set and made
predictions on the test set. It is not a particular complex problem so
I would expect the estimated RMSE and R2 from 10-fold CV and test set
to be reasonably close with each other.
It turns out that the estimated RMSE are quite close: "CV 0.7435"
versus “test set 0.7429”. However, I found the two R2 scores are as
follows: “CV -3.0168” versus “test set 0.8718”. I can live with the
negative R2, but I am confused by this inconsistency. I wonder if
anyone can help. Thank you in advance.
=================Here is my script=================
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import cross_val_score
lm = LinearRegression()
train_scores_mse = cross_val_score(lm, trainX_trans_filtered, trainY,
cv=10,
scoring = 'mean_squared_error')
train_scores_rmse = np.sqrt(-1.0 * train_scores_mse)
train_scores_r2 = cross_val_score(lm, trainX_trans_filtered, trainY,
cv=10,
scoring = 'r2')
print "CV estimated RMSE: {0} \nCV estimated R2:
{1}".format(np.mean(train_scores_rmse), np.mean(train_scores_r2))
CV estimated RMSE: 0.743556872074
CV estimated R2: -3.01685516116
# apply to the test set
lm.fit(trainX_trans_filtered, trainY)
testY_pred = lm.predict(testX_trans_filtered)from sklearn.metrics import
r2_score, mean_squared_error
test_score_r2 = r2_score(testY, testY_pred)
test_score_rmse = np.sqrt(mean_squared_error(testY, testY_pred))
print "Test set RMSE: {0} \nTest set R2: {1}".format(test_score_rmse,
test_score_r2)
Test set RMSE: 0.742917835704
Test set R2: 0.871834926473
Cheers,
Lei
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general