Hello,

I was comparing scores from CV with a score obtained from training on a
subset of the data used in the CV and get very different answers. This
surprised me, should I be? If not how do I understand how/why this happens?

I run:

scores = cross_validation.cross_val_score(clf, X_dev, y_dev,
scoring="roc_auc", n_jobs=6)

and get three scores around 0.77.

Then I split X_dev with train_test_split(test_size=0.33) and retrain my
classifier on the training part and evaluate the score on the training. Now
the score is around 0.70.

I thought that the second part, training the classifier on X_train, would
be similar to one of the splits that cross validation comes up with. If the
score between the three CV splits varied a lot more then I would not be
surprised, but the variation is pretty small compared to the difference
between the CV scores and training on 2/3 of X_dev.

The (full) code is here:
https://gist.github.com/betatim/822785858d15a92aeafb

Surely-overlooking-something,
T
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to