I wouldn't expect those splits to be the same by nature. And additionally you are seeding the randomness differently in the two cases. Take a close look at the generated splits - maybe their composition already explains the discrepancies.
On Tue, Sep 15, 2015 at 4:55 PM, Mamun Rashid <mamunbabu2...@gmail.com> wrote: > I am seeing a discrepancy between classification performance between two > cross validation technique using the same data. I was wondering if anyone > can shed some light on this. > > Thanks in advance for your help. > > Mamun > > > - Method 1: cross_validation.train_test_split > - Method 2: StratifiedKFold. > > Two Examples with same data set > > Data Set 5500[n_samples :: Class 1 = 500 ; Class 0 = 5000 ] by 193 Features > > ========================================================================================== > Method 1 [ Random Iteration with train_test_split ] > > for i in range(0,5): > X_tr, X_te, y_tr, y_te = > cross_validation.train_test_split(X_train.values, y_train, test_size=0.2, > random_state=i) > clf = RandomForestClassifier(n_estimators=250, max_depth=None, > min_samples_split=1, random_state=0, oob_score=True) > y_score = clf.fit(X_tr, y_tr).predict(X_te) > y_prob = clf.fit(X_tr, y_tr).predict_proba(X_te) > cm = confusion_matrix(y_te, y_score) > print cm > fpr, tpr, thresholds = roc_curve(y_te,y_prob[:,1]) > roc_auc = auc(fpr, tpr); > print "ROC AUC: ", roc_auc > > Result of method 1 > > Iteration 1 ROC AUC: 0.91 > [[998 4] > [ 42 56]] > > Iteration 5 ROC AUC: 0.88 > [[1000 3] > [ 35 62]] > > > ==========================================================================================Method > 2 [ StratifiedKFold cross validation ] > > cv = StratifiedKFold(y_train, n_folds=5,random_state=None,shuffle=False) > clf = RandomForestClassifier(n_estimators=250, max_depth=None, > min_samples_split=1, random_state=None, oob_score=True) > for train, test in cv: > #for train, test in kf: > y_score = clf.fit(X_train.values[train], > y_train[train]).predict(X_train.values[test]) > y_prob = clf.fit(X_train.values[train], > y_train[train]).predict_proba(X_train.values[test]) > cm = confusion_matrix(y_train[test], y_score) > print cm > fpr, tpr, thresholds = roc_curve(y_train[test],y_prob[:,1]) > roc_auc = auc(fpr, tpr); > print "ROC AUC: ", roc_auc > > Result of method 2 > > Fold 1 ROC AUC: 0.76 > Fold 1 Confusion Matrix > [[995 5] > [ 92 8]] > > Fold 5 ROC AUC: 0.77 > Fold 5 Confusion Matrix > [[986 14] > [ 76 23]] > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general