Hi,
I am trying to have a better understanding of CV in scikits e.g. KFold.
I am building a binary SVC classifier and predicting probability for multiple
classifiers. Standard stuff. This is OvR with SVC and predict_proba
methodology.
So, lets say I have 9 classifies and my K=3. When done with training, I should
have 9 saved pickled classifies and each classifier when scored will returned
an average probability of the K=3 runs. That is, provide a feature vector, use
clf.predict_proba for each clf and I
have my probablity.
Well..when I run my loop, I print the classification report for each K
iteration. Thus, I will have three reports.
A standard output is below and they pretty much look the same for each
iteration.
precision recall f1-score support
1 0.94 0.95 0.95 982
2 0.89 0.89 0.89 560
3 0.86 0.86 0.86 874
4 0.88 0.90 0.89 883
5 0.88 0.86 0.87 168
6 0.89 0.87 0.88 249
7 0.95 0.91 0.93 119
8 0.97 0.93 0.95 180
9 0.97 0.97 0.97 154
avg / total 0.90 0.90 0.90 4169
Now, prior to training, I split the sample into training and testing using the
below.
X_train, X_test,y_train, y_test = cross_validation.train_test_split(X,y,
test_size=test_size, random_state=1)
For my CV loop I have code like the below. All I am doing is running each K
and within each K looping through each binary classifier. As a note, y is
multiclass and for each iteration converting to a binary in the spirit of OvR.
I then use the test set to generate classification reports that look like the
above report.
cv = KFold(y_train.shape[0],3,random_state=seed)
for label in label_distribution.keys():
clf_handlers[label] =
svm.SVC(kernel="linear",C=C,verbose=True,probability=True,class_weight=None)
for i, (train, test) in enumerate(cv):
yp_handler = {}
for label in label_distribution.keys():
v = np.where(y_train==label, 1, 0)
clf = clf_handlers[label]
clf.fit(X_train[train], v[train])
yp_handler[label] = clf.predict_proba(X_train[test])
clf_handlers[label]=clf
Problem that I have comes when scoring the test data sets created from
train_test_split.
I pass in the X_test feature vector, and given the observed y_test labels, I
iterate trough each row, and choose the classifier label that has the
higest probibailty. Same thing I do for each iteration of the CV loop. So, I
am having and issue for understaing the below reports which do not look so good.
So, each clf for each label will return a probabilty? Then, just choose the
max prob? Right?
So....any issues with my methodology that yields the below results?
[[ 0 0 786 14 0 0 0 0 0]
[ 0 0 476 0 0 0 0 0 0]
[ 0 0 735 1 0 0 0 0 0]
[ 0 0 787 20 0 0 0 0 0]
[ 0 0 168 0 0 0 0 0 0]
[ 0 0 221 1 0 1 0 0 0]
[ 0 0 109 0 0 0 2 0 0]
[ 0 0 139 1 0 0 0 3 0]
[ 0 0 110 1 0 0 0 0 0]]
precision recall f1-score support
1 0.00 0.00 0.00 800
2 0.00 0.00 0.00 476
3 0.21 1.00 0.34 736
4 0.53 0.02 0.05 807
5 0.00 0.00 0.00 168
6 1.00 0.00 0.01 223
7 1.00 0.02 0.04 111
8 1.00 0.02 0.04 143
9 0.00 0.00 0.00 111
avg / total 0.30 0.21 0.08 3575
#My loop for scoring a final testing dataset
#I load each classifier ihnto a dict as key and lable and value
y_predicted = zeros(y_test.shape[0])
yp_handler = {}
for label, clf in self.clf_handler.iteritems():
yp_handler[label]= clf.predict_proba(X_test)
for i in xrange(y_test.shape[0]):
label_predictions = {}
for label, predicted in yp_handler.iteritems():
label_predictions[label]=predicted[i][1]
max_label_predictions = max(label_predictions,
key=label_predictions.get)
y_predicted[i]=max_label_predictions
print confusion_matrix(y_test,y_predicted)
print classification_report(y_test,y_predicted)
Sent from my iPad
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general