I wouldn't expect those splits to be the same by nature. And additionally
you are seeding the randomness differently in the two cases. Take a close
look at the generated splits - maybe their composition already explains the
discrepancies.

On Tue, Sep 15, 2015 at 4:55 PM, Mamun Rashid <mamunbabu2...@gmail.com>
wrote:

> I am seeing a discrepancy between classification performance between two
> cross validation technique using the same data. I was wondering if anyone
> can shed some light on this.
>
> Thanks in advance for your help.
>
> Mamun
>
>
>    - Method 1: cross_validation.train_test_split
>    - Method 2: StratifiedKFold.
>
> Two Examples with same data set
>
> Data Set 5500[n_samples :: Class 1 = 500 ; Class 0 = 5000 ] by 193 Features
>
> ==========================================================================================
> Method 1 [ Random Iteration with train_test_split ]
>
> for i in range(0,5):
>     X_tr, X_te, y_tr, y_te = 
> cross_validation.train_test_split(X_train.values, y_train, test_size=0.2, 
> random_state=i)
>     clf = RandomForestClassifier(n_estimators=250, max_depth=None, 
> min_samples_split=1, random_state=0, oob_score=True)
>     y_score = clf.fit(X_tr, y_tr).predict(X_te)
>     y_prob = clf.fit(X_tr, y_tr).predict_proba(X_te)
>     cm = confusion_matrix(y_te, y_score)
>     print cm
>     fpr, tpr, thresholds = roc_curve(y_te,y_prob[:,1])
>     roc_auc = auc(fpr, tpr);
>     print "ROC AUC: ", roc_auc
>
> Result of method 1
>
> Iteration 1 ROC AUC:  0.91
> [[998   4]
>  [ 42  56]]
>
> Iteration 5 ROC AUC:  0.88
> [[1000    3]
>  [  35   62]]
>
>
> ==========================================================================================Method
> 2 [ StratifiedKFold cross validation ]
>
> cv = StratifiedKFold(y_train, n_folds=5,random_state=None,shuffle=False)
> clf = RandomForestClassifier(n_estimators=250, max_depth=None, 
> min_samples_split=1, random_state=None, oob_score=True)
> for train, test in cv:
>     #for train, test in kf:
>     y_score = clf.fit(X_train.values[train], 
> y_train[train]).predict(X_train.values[test])
>     y_prob = clf.fit(X_train.values[train], 
> y_train[train]).predict_proba(X_train.values[test])
>     cm = confusion_matrix(y_train[test], y_score)
>     print cm
>     fpr, tpr, thresholds = roc_curve(y_train[test],y_prob[:,1])
>     roc_auc = auc(fpr, tpr);
>     print "ROC AUC: ", roc_auc
>
> Result of method 2
>
> Fold 1 ROC AUC:  0.76
> Fold 1 Confusion Matrix
> [[995   5]
>  [ 92   8]]
>
> Fold 5 ROC AUC:  0.77
> Fold 5 Confusion Matrix
> [[986  14]
>  [ 76  23]]
>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to