Hi Michael and Andy,
Thanks so much for your input. I think the cause of the problem is the iid
assumption. I am new and did not know the default CV split the training set
without shuffling. After add "cv = ShuffleSplit", everything seems fine.
And the reason why r2 on the test set is ok at the first place is because the
test set was generated as a random sample from the original set.
Cheers,
Lei
—
Sent from a not-so-smartphone.
On Mon, Dec 01, 2014 at 11:01 PM, Michael Eickenberg
<michael.eickenb...@gmail.com(mailto:michael.eickenb...@gmail.com|28)> wrote:
> Dear Lei,
>
> would it possible for you to report the same numbers for a target set Y which
> is standardized? If Ytest has variance 1, then the MSE (not RMSE) and r2
> should sum to about 1. (it is not always possible to jsutify this step on all
> types of data)
>
> If you standardize all of Ytrain, you can check Andy's very important comment
> on iid-ness in an around way: if the variances of your contiguous folds
> differ strongly from 1 after global normalization, then your data may be
> drifting somehow.
>
> Michael
>
>
> On Tuesday, December 2, 2014, Andy
> <t3k...@gmail.com(mailto:t3k...@gmail.com|16)> wrote:
> > Hi Lei.
> >
> > If the CV score and the test set score are very different, that suggests
> > that the IID assumption is violated.
> > I think be default cross_val_score does not shuffle, which can be an issue
> > if your data is sorted in some way.
> >
> > The fact that rmse doesn't show this might just tell you that the rmse
> > doesn't really capture the correlation here.
> > A negative R2 on the cross-validation means you are not learning anything
> > basically.
> >
> > Cheers,
> > Andy
> >
> >
> >
> > On 12/01/2014 05:30 PM, Lei Gong wrote:
> > > Hey all,
> > >
> > >
> > > First of all, I want to thank you for this awesome project.
> > >
> > > I am working on a project where I want to fit a linear regression to make
> > > some predictions. The dataset was split into training/test (70/30). I
> > > then applied 10-fold CV on the training set and made predictions on the
> > > test set. It is not a particular complex problem so I would expect the
> > > estimated RMSE and R2 from 10-fold CV and test set to be reasonably close
> > > with each other.
> > >
> > > It turns out that the estimated RMSE are quite close: "CV 0.7435" versus
> > > “test set 0.7429”. However, I found the two R2 scores are as follows: “CV
> > > -3.0168” versus “test set 0.8718”. I can live with the negative R2, but I
> > > am confused by this inconsistency. I wonder if anyone can help. Thank you
> > > in advance.
> > >
> > > =================Here is my script=================
> > >
> > > from sklearn.linear_model import LinearRegression
> > > from sklearn.cross_validation import cross_val_score
> > >
> > > lm = LinearRegression()
> > > train_scores_mse = cross_val_score(lm, trainX_trans_filtered, trainY,
> > > cv=10,
> > > scoring = 'mean_squared_error')
> > > train_scores_rmse = np.sqrt(-1.0 * train_scores_mse)
> > > train_scores_r2 = cross_val_score(lm, trainX_trans_filtered, trainY,
> > > cv=10,
> > > scoring = 'r2')
> > > print "CV estimated RMSE: {0} \nCV estimated R2:
> > > {1}".format(np.mean(train_scores_rmse), np.mean(train_scores_r2))
> > >
> > > CV estimated RMSE: 0.743556872074 CV estimated R2: -3.01685516116
> > >
> > > # apply to the test set lm.fit(trainX_trans_filtered, trainY) testY_pred
> > > = lm.predict(testX_trans_filtered)from sklearn.metrics import r2_score,
> > > mean_squared_error
> > > test_score_r2 = r2_score(testY, testY_pred) test_score_rmse =
> > > np.sqrt(mean_squared_error(testY, testY_pred)) print "Test set RMSE: {0}
> > > \nTest set R2: {1}".format(test_score_rmse, test_score_r2)
> > >
> > > Test set RMSE: 0.742917835704 Test set R2: 0.871834926473
> > >
> > >
> > >
> > > Cheers,
> > >
> > > Lei
> > >
> > >
> > >
> > >
> > >
> > >
> > > ------------------------------------------------------------------------------
> > > Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from
> > > Actuate! Instantly Supercharge Your Business Reports and Dashboards with
> > > Interactivity, Sharing, Native Excel Exports, App Integration & more Get
> > > technology previously reserved for billion-dollar corporations, FREE
> > > http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
> > >
> > >
> > > _______________________________________________ Scikit-learn-general
> > > mailing list
> > > Scikit-learn-general@lists.sourceforge.net(javascript:_e(%7B%7D,'cvml','Scikit-learn-general@lists.sourceforge.net');|42)
> > > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> ------------------------------------------------------------------------------
> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
> with Interactivity, Sharing, Native Excel Exports, App Integration & more
> Get technology previously reserved for billion-dollar corporations, FREE
> http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general