train_test_split is not stratified.
In master, you can use "stratify=y" to make it stratified.
Also: randomness.
On 09/15/2015 10:55 AM, Mamun Rashid wrote:
I am seeing a discrepancy between classification performance between
two cross validation technique using the same data. I was wondering if
anyone can shed some light on this.
Thanks in advance for your help.
Mamun
* Method 1: cross_validation.train_test_split
* Method 2: StratifiedKFold.
Two Examples with same data set
Data Set 5500[n_samples :: Class 1 = 500 ; Class 0 = 5000 ] by 193
Features
==========================================================================================
Method 1 [ Random Iteration with train_test_split ]
|for i in range(0,5): X_tr, X_te, y_tr, y_te =
cross_validation.train_test_split(X_train.values, y_train,
test_size=0.2, random_state=i) clf =
RandomForestClassifier(n_estimators=250, max_depth=None,
min_samples_split=1, random_state=0, oob_score=True) y_score =
clf.fit(X_tr, y_tr).predict(X_te) y_prob = clf.fit(X_tr,
y_tr).predict_proba(X_te) cm = confusion_matrix(y_te, y_score) print
cm fpr, tpr, thresholds = roc_curve(y_te,y_prob[:,1]) roc_auc =
auc(fpr, tpr); print "ROC AUC: ", roc_auc |
Result of method 1
|Iteration 1 ROC AUC: 0.91 [[998 4] [ 42 56]] Iteration 5 ROC AUC: 0.88
[[1000 3] [ 35 62]] |
==========================================================================================
Method 2 [ StratifiedKFold cross validation ]
|cv = StratifiedKFold(y_train,
n_folds=5,random_state=None,shuffle=False) clf =
RandomForestClassifier(n_estimators=250, max_depth=None,
min_samples_split=1, random_state=None, oob_score=True) for train,
test in cv: #for train, test in kf: y_score =
clf.fit(X_train.values[train],
y_train[train]).predict(X_train.values[test]) y_prob =
clf.fit(X_train.values[train],
y_train[train]).predict_proba(X_train.values[test]) cm =
confusion_matrix(y_train[test], y_score) print cm fpr, tpr, thresholds
= roc_curve(y_train[test],y_prob[:,1]) roc_auc = auc(fpr, tpr); print
"ROC AUC: ", roc_auc |
Result of method 2
|Fold 1 ROC AUC: 0.76 Fold 1 Confusion Matrix [[995 5] [ 92 8]] Fold 5
ROC AUC: 0.77 Fold 5 Confusion Matrix [[986 14] [ 76 23]] |
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general