Dear Lei,
would it possible for you to report the same numbers for a target set Y
which is standardized? If Ytest has variance 1, then the MSE (not RMSE) and
r2 should sum to about 1. (it is not always possible to jsutify this step
on all types of data)
If you standardize all of Ytrain, you can check Andy's very important
comment on iid-ness in an around way: if the variances of your contiguous
folds differ strongly from 1 after global normalization, then your data may
be drifting somehow.
Michael
On Tuesday, December 2, 2014, Andy <t3k...@gmail.com> wrote:
> Hi Lei.
>
> If the CV score and the test set score are very different, that suggests
> that the IID assumption is violated.
> I think be default cross_val_score does not shuffle, which can be an issue
> if your data is sorted in some way.
>
> The fact that rmse doesn't show this might just tell you that the rmse
> doesn't really capture the correlation here.
> A negative R2 on the cross-validation means you are not learning anything
> basically.
>
> Cheers,
> Andy
>
>
>
> On 12/01/2014 05:30 PM, Lei Gong wrote:
>
> Hey all,
>
>
> First of all, I want to thank you for this awesome project.
>
> I am working on a project where I want to fit a linear regression to
> make some predictions. The dataset was split into training/test (70/30). I
> then applied 10-fold CV on the training set and made predictions on the
> test set. It is not a particular complex problem so I would expect the
> estimated RMSE and R2 from 10-fold CV and test set to be reasonably close
> with each other.
>
> It turns out that the estimated RMSE are quite close: "CV 0.7435" versus
> “test set 0.7429”. However, I found the two R2 scores are as follows: “CV
> -3.0168” versus “test set 0.8718”. I can live with the negative R2, but I
> am confused by this inconsistency. I wonder if anyone can help. Thank you
> in advance.
>
> =================Here is my script=================
>
> from sklearn.linear_model import LinearRegression
> from sklearn.cross_validation import cross_val_score
>
> lm = LinearRegression()
> train_scores_mse = cross_val_score(lm, trainX_trans_filtered, trainY,
> cv=10,
> scoring = 'mean_squared_error')
> train_scores_rmse = np.sqrt(-1.0 * train_scores_mse)
> train_scores_r2 = cross_val_score(lm, trainX_trans_filtered, trainY,
> cv=10,
> scoring = 'r2')
> print "CV estimated RMSE: {0} \nCV estimated R2:
> {1}".format(np.mean(train_scores_rmse), np.mean(train_scores_r2))
>
> CV estimated RMSE: 0.743556872074
> CV estimated R2: -3.01685516116
>
> # apply to the test set
> lm.fit(trainX_trans_filtered, trainY)
> testY_pred = lm.predict(testX_trans_filtered)from sklearn.metrics import
> r2_score, mean_squared_error
>
> test_score_r2 = r2_score(testY, testY_pred)
> test_score_rmse = np.sqrt(mean_squared_error(testY, testY_pred))
> print "Test set RMSE: {0} \nTest set R2: {1}".format(test_score_rmse,
> test_score_r2)
>
> Test set RMSE: 0.742917835704
> Test set R2: 0.871834926473
>
> Cheers,
>
> Lei
>
>
>
>
> ------------------------------------------------------------------------------
> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
> with Interactivity, Sharing, Native Excel Exports, App Integration & more
> Get technology previously reserved for billion-dollar corporations,
> FREEhttp://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
>
>
>
> _______________________________________________
> Scikit-learn-general mailing listscikit-learn-gene...@lists.sourceforge.net
> <javascript:_e(%7B%7D,'cvml','Scikit-learn-general@lists.sourceforge.net');>https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general