I am seeing a discrepancy between classification performance between two cross
validation technique using the same data. I was wondering if anyone can shed
some light on this.
Thanks in advance for your help.
Mamun
Method 1: cross_validation.train_test_split
Method 2: StratifiedKFold.
Two Examples with same data set
Data Set 5500[n_samples :: Class 1 = 500 ; Class 0 = 5000 ] by 193 Features
==========================================================================================
Method 1 [ Random Iteration with train_test_split ]
for i in range(0,5):
X_tr, X_te, y_tr, y_te = cross_validation.train_test_split(X_train.values,
y_train, test_size=0.2, random_state=i)
clf = RandomForestClassifier(n_estimators=250, max_depth=None,
min_samples_split=1, random_state=0, oob_score=True)
y_score = clf.fit(X_tr, y_tr).predict(X_te)
y_prob = clf.fit(X_tr, y_tr).predict_proba(X_te)
cm = confusion_matrix(y_te, y_score)
print cm
fpr, tpr, thresholds = roc_curve(y_te,y_prob[:,1])
roc_auc = auc(fpr, tpr);
print "ROC AUC: ", roc_auc
Result of method 1
Iteration 1 ROC AUC: 0.91
[[998 4]
[ 42 56]]
Iteration 5 ROC AUC: 0.88
[[1000 3]
[ 35 62]]
==========================================================================================
Method 2 [ StratifiedKFold cross validation ]
cv = StratifiedKFold(y_train, n_folds=5,random_state=None,shuffle=False)
clf = RandomForestClassifier(n_estimators=250, max_depth=None,
min_samples_split=1, random_state=None, oob_score=True)
for train, test in cv:
#for train, test in kf:
y_score = clf.fit(X_train.values[train],
y_train[train]).predict(X_train.values[test])
y_prob = clf.fit(X_train.values[train],
y_train[train]).predict_proba(X_train.values[test])
cm = confusion_matrix(y_train[test], y_score)
print cm
fpr, tpr, thresholds = roc_curve(y_train[test],y_prob[:,1])
roc_auc = auc(fpr, tpr);
print "ROC AUC: ", roc_auc
Result of method 2
Fold 1 ROC AUC: 0.76
Fold 1 Confusion Matrix
[[995 5]
[ 92 8]]
Fold 5 ROC AUC: 0.77
Fold 5 Confusion Matrix
[[986 14]
[ 76 23]]
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general