Re: [Scikit-learn-general] CV scores vs scores on a manual split

Andy Thu, 19 Feb 2015 14:00:13 -0800

You give the roc_auc_score the result of "predict". You should give itthe result of "predict_proba".

This came up already quite a bit, not sure how we can avoid peoplemaking this mistake.




On 02/19/2015 04:56 AM, Tim Head wrote:

Hi Gilles,
On Thu Feb 19 2015 at 8:35:35 AM Gilles Louppe <[email protected]<mailto:[email protected]>> wrote:
    Hi Tim,

    By default, cross_val_score uses on StratifiedKFold(shuffle=False) to
    create the train/test folds while train_test_split uses ShuffleSplit.
    The discrepancy you observe might therefore come from either
    shuffling, the stratification of the labels or both of them.

    Can you set the CV parameter in cross_val_score to
    - ShuffleSplit(n_folds=3, shuffle=True)
    - ShuffleSplit(n_folds=3, shuffle=False)
    - StratifiedKFold(n_folds=3, shuffle=True)
    - StratifiedKFold(n_folds=3, shuffle=False)
    and then try to determine in which cases scores are consistent?
The two classes are pretty balanced ("mean" label value = 0.529 withlabels 0 and 1) so naively the stratification should not change anything.
Below what I get for four options I tried:

cv=3
[ 0.77333168 0.77171963 0.77402341]
------------------------------------------
cv=ShuffleSplit(670000, n_iter=3, test_size=0.33, random_state=None)
[ 0.7745969 0.77283909 0.77140412]
------------------------------------------
cv=sklearn.cross_validation.KFold(n=670000, n_folds=3, shuffle=False,random_state=None)
[ 0.77326581 0.77155045 0.77374548]
------------------------------------------
cv=sklearn.cross_validation.KFold(n=670000, n_folds=3, shuffle=True,random_state=None)
[ 0.77298131 0.77332662 0.77225896]
------------------------------------------
Conclusion they all give the same answer, which is what I'd expectgiven that the dataset is balanced and already in random order :-/ andstill splitting X_dev "by hand" with train_test_split() gives me adifferent answer.
For the moment I think there must be an (obvious) bug in my scriptthat I need to find.
T
p.s I posted a minimal script herehttps://gist.github.com/betatim/a31777c36e3b4b6f21bb it uses the firstmillion samples from this dataset which is quite large:http://archive.ics.uci.edu/ml/datasets/HIGGS
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk


_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] CV scores vs scores on a manual split

Reply via email to