[Scikit-learn-general] CV Kfold interatioins look good, test sets....not so good.

David Montgomery Sat, 25 Aug 2012 17:22:54 -0700

Hi,

I am trying to have a better understanding of CV in scikits e.g. KFold.


I am building a binary SVC classifier and predicting probability for multiple 
classifiers.  Standard stuff.  This is OvR with SVC and predict_proba 
methodology.

So, lets say I have 9 classifies and my K=3.  When done with training, I should 
have 9 saved pickled classifies and each classifier when scored will returned 
an average probability of the K=3 runs.  That is, provide a feature vector, use 
clf.predict_proba for each clf and I 
have my probablity.


Well..when I run my loop, I print the classification report for each K 
iteration.  Thus, I will have three reports.  
A standard output is below and they pretty much look the same for each 
iteration.


             precision    recall  f1-score   support

          1       0.94      0.95      0.95       982
          2       0.89      0.89      0.89       560
          3       0.86      0.86      0.86       874
          4       0.88      0.90      0.89       883
          5       0.88      0.86      0.87       168
          6       0.89      0.87      0.88       249
          7       0.95      0.91      0.93       119
          8       0.97      0.93      0.95       180
          9       0.97      0.97      0.97       154

avg / total       0.90      0.90      0.90      4169


Now, prior to training, I split the sample into training and testing using the 
below.

X_train, X_test,y_train, y_test = cross_validation.train_test_split(X,y, 
test_size=test_size, random_state=1)


For my CV loop I have code like the below.  All I am doing is running each K 
and within each K looping through each binary classifier.  As a note, y is 
multiclass and for each iteration converting to a binary in the spirit of OvR.  
I then use the test set to generate classification reports that look like the 
above report.

cv = KFold(y_train.shape[0],3,random_state=seed)
for label in label_distribution.keys():
    clf_handlers[label] = 
svm.SVC(kernel="linear",C=C,verbose=True,probability=True,class_weight=None)
        
 for i, (train, test) in enumerate(cv):
    yp_handler = {}
    for label in label_distribution.keys():
        v = np.where(y_train==label, 1, 0)            
        clf = clf_handlers[label]
        clf.fit(X_train[train], v[train])
        yp_handler[label] = clf.predict_proba(X_train[test])
        clf_handlers[label]=clf
      
Problem that I have comes when scoring the test data sets created from 
train_test_split.

I  pass in the X_test feature vector, and given the observed y_test labels, I 
iterate trough each row, and choose the classifier label that has the 
higest probibailty.  Same thing I do for each iteration of the CV loop.   So, I 
am having and issue for understaing the below reports which do not look so good.

So, each clf for each label will return a probabilty?  Then, just choose the 
max prob?  Right?

So....any issues with my methodology that yields the below results? 

[[  0   0 786  14   0   0   0   0   0]
 [  0   0 476   0   0   0   0   0   0]
 [  0   0 735   1   0   0   0   0   0]
 [  0   0 787  20   0   0   0   0   0]
 [  0   0 168   0   0   0   0   0   0]
 [  0   0 221   1   0   1   0   0   0]
 [  0   0 109   0   0   0   2   0   0]
 [  0   0 139   1   0   0   0   3   0]
 [  0   0 110   1   0   0   0   0   0]]
             precision    recall  f1-score   support

          1       0.00      0.00      0.00       800
          2       0.00      0.00      0.00       476
          3       0.21      1.00      0.34       736
          4       0.53      0.02      0.05       807
          5       0.00      0.00      0.00       168
          6       1.00      0.00      0.01       223
          7       1.00      0.02      0.04       111
          8       1.00      0.02      0.04       143
          9       0.00      0.00      0.00       111

avg / total       0.30      0.21      0.08      3575





     #My loop for scoring a final testing dataset
     #I load each classifier ihnto a dict as key and lable and value       
        y_predicted = zeros(y_test.shape[0])    
        yp_handler = {} 
        for label, clf in self.clf_handler.iteritems():
            yp_handler[label]= clf.predict_proba(X_test)        
        for i in xrange(y_test.shape[0]):
            label_predictions = {}
            for label, predicted in yp_handler.iteritems():      
                label_predictions[label]=predicted[i][1]
            max_label_predictions = max(label_predictions, 
key=label_predictions.get) 
            y_predicted[i]=max_label_predictions
        print confusion_matrix(y_test,y_predicted)
        print classification_report(y_test,y_predicted)
        





Sent from my iPad
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] CV Kfold interatioins look good, test sets....not so good.

Reply via email to