I am seeing a discrepancy between classification performance between two cross 
validation technique using the same data. I was wondering if anyone can shed 
some light on this.

Thanks in advance for your help.

Mamun


Method 1: cross_validation.train_test_split
Method 2: StratifiedKFold.
Two Examples with same data set

Data Set 5500[n_samples :: Class 1 = 500 ; Class 0 = 5000 ] by 193 Features

==========================================================================================
Method 1 [ Random Iteration with train_test_split ]

for i in range(0,5):
    X_tr, X_te, y_tr, y_te = cross_validation.train_test_split(X_train.values, 
y_train, test_size=0.2, random_state=i)
    clf = RandomForestClassifier(n_estimators=250, max_depth=None, 
min_samples_split=1, random_state=0, oob_score=True)
    y_score = clf.fit(X_tr, y_tr).predict(X_te)
    y_prob = clf.fit(X_tr, y_tr).predict_proba(X_te)
    cm = confusion_matrix(y_te, y_score)
    print cm
    fpr, tpr, thresholds = roc_curve(y_te,y_prob[:,1])
    roc_auc = auc(fpr, tpr);
    print "ROC AUC: ", roc_auc
Result of method 1

Iteration 1 ROC AUC:  0.91
[[998   4]
 [ 42  56]]

Iteration 5 ROC AUC:  0.88
[[1000    3]
 [  35   62]]
==========================================================================================

Method 2 [ StratifiedKFold cross validation ]

cv = StratifiedKFold(y_train, n_folds=5,random_state=None,shuffle=False)
clf = RandomForestClassifier(n_estimators=250, max_depth=None, 
min_samples_split=1, random_state=None, oob_score=True)
for train, test in cv:
    #for train, test in kf:
    y_score = clf.fit(X_train.values[train], 
y_train[train]).predict(X_train.values[test])
    y_prob = clf.fit(X_train.values[train], 
y_train[train]).predict_proba(X_train.values[test])
    cm = confusion_matrix(y_train[test], y_score)
    print cm
    fpr, tpr, thresholds = roc_curve(y_train[test],y_prob[:,1])
    roc_auc = auc(fpr, tpr);
    print "ROC AUC: ", roc_auc
Result of method 2

Fold 1 ROC AUC:  0.76
Fold 1 Confusion Matrix
[[995   5]
 [ 92   8]]

Fold 5 ROC AUC:  0.77
Fold 5 Confusion Matrix
[[986  14]
 [ 76  23]]
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to