Re: [Scikit-learn-general] CV Kfold interatioins look good, test sets....not so good.

Andreas Mueller Sun, 26 Aug 2012 02:49:01 -0700

Hi David.

I didn't look at your code in detail, but there are several tools insklearn that could

help you simplify your setup and maybe get rid of your problem.
Is there any reason to use one-vs-rest instead of one-vs-one?

The SVC has one-vs-one built in and you could just use that and notfiddle with the labels.If you prefer to do one-vs-all, you can use the OneVsRestClassifier<http://scikit-learn.org/dev/modules/classes.html#module-sklearn.multiclass>together with the SVC,

which would do the work for you.


Hope that helps,
Andy

On 08/26/2012 02:22 AM, David Montgomery wrote:

Hi,

I am trying to have a better understanding of CV in scikits e.g. KFold.

I am building a binary SVC classifier and predicting probability for multiple 
classifiers.  Standard stuff.  This is OvR with SVC and predict_proba 
methodology.

So, lets say I have 9 classifies and my K=3.  When done with training, I should 
have 9 saved pickled classifies and each classifier when scored will returned 
an average probability of the K=3 runs.  That is, provide a feature vector, use 
clf.predict_proba for each clf and I
have my probablity.


Well..when I run my loop, I print the classification report for each K 
iteration.  Thus, I will have three reports.
A standard output is below and they pretty much look the same for each 
iteration.


              precision    recall  f1-score   support

           1       0.94      0.95      0.95       982
           2       0.89      0.89      0.89       560
           3       0.86      0.86      0.86       874
           4       0.88      0.90      0.89       883
           5       0.88      0.86      0.87       168
           6       0.89      0.87      0.88       249
           7       0.95      0.91      0.93       119
           8       0.97      0.93      0.95       180
           9       0.97      0.97      0.97       154

avg / total       0.90      0.90      0.90      4169


Now, prior to training, I split the sample into training and testing using the 
below.

X_train, X_test,y_train, y_test = cross_validation.train_test_split(X,y, 
test_size=test_size, random_state=1)


For my CV loop I have code like the below.  All I am doing is running each K 
and within each K looping through each binary classifier.  As a note, y is 
multiclass and for each iteration converting to a binary in the spirit of OvR.  
I then use the test set to generate classification reports that look like the 
above report.

cv = KFold(y_train.shape[0],3,random_state=seed)
for label in label_distribution.keys():
     clf_handlers[label] = 
svm.SVC(kernel="linear",C=C,verbose=True,probability=True,class_weight=None)

for i, (train, test) in enumerate(cv):

     yp_handler = {}
     for label in label_distribution.keys():
         v = np.where(y_train==label, 1, 0)
         clf = clf_handlers[label]
         clf.fit(X_train[train], v[train])
         yp_handler[label] = clf.predict_proba(X_train[test])
         clf_handlers[label]=clf

Problem that I have comes when scoring the test data sets created from train_test_split.


I  pass in the X_test feature vector, and given the observed y_test labels, I 
iterate trough each row, and choose the classifier label that has the
higest probibailty.  Same thing I do for each iteration of the CV loop.   So, I 
am having and issue for understaing the below reports which do not look so good.

So, each clf for each label will return a probabilty?  Then, just choose the 
max prob?  Right?

So....any issues with my methodology that yields the below results?

[[  0   0 786  14   0   0   0   0   0]
  [  0   0 476   0   0   0   0   0   0]
  [  0   0 735   1   0   0   0   0   0]
  [  0   0 787  20   0   0   0   0   0]
  [  0   0 168   0   0   0   0   0   0]
  [  0   0 221   1   0   1   0   0   0]
  [  0   0 109   0   0   0   2   0   0]
  [  0   0 139   1   0   0   0   3   0]
  [  0   0 110   1   0   0   0   0   0]]
              precision    recall  f1-score   support

           1       0.00      0.00      0.00       800
           2       0.00      0.00      0.00       476
           3       0.21      1.00      0.34       736
           4       0.53      0.02      0.05       807
           5       0.00      0.00      0.00       168
           6       1.00      0.00      0.01       223
           7       1.00      0.02      0.04       111
           8       1.00      0.02      0.04       143
           9       0.00      0.00      0.00       111

avg / total       0.30      0.21      0.08      3575





      #My loop for scoring a final testing dataset
      #I load each classifier ihnto a dict as key and lable and value
         y_predicted = zeros(y_test.shape[0])
         yp_handler = {}
         for label, clf in self.clf_handler.iteritems():
             yp_handler[label]= clf.predict_proba(X_test)
         for i in xrange(y_test.shape[0]):
             label_predictions = {}
             for label, predicted in yp_handler.iteritems():
                 label_predictions[label]=predicted[i][1]
             max_label_predictions = max(label_predictions, 
key=label_predictions.get)
             y_predicted[i]=max_label_predictions
         print confusion_matrix(y_test,y_predicted)
         print classification_report(y_test,y_predicted)





Sent from my iPad
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] CV Kfold interatioins look good, test sets....not so good.

Reply via email to