Hello there,
correct me if I am wrong, but I couldn't find a method for calculating the 
average confusion matrix from a k-fold validation routine.
The average confusion matrix is quite handy and is used in many scientific 
papers ... well at least the one I read!
So I wrote a couple of functions that might be useful for other users (saw also 
some questions on Stack Overflow).
There are things that I don't know how to implement correctly like knowing how 
many classes there are in total,
and what happens if the matrix has a different dimension in the case that some 
of the classes are not present in the group set.
Hopefully somebody can contribute!

Example:
clf = RandomForestClassifier(n_estimators=10)
total_classes=list(set(Y))
kfolder_confusion(clf,X,Y,total_classes=len(total_classes),n_folds=10)


def average_matrix(cm):
    """ Given a confusion matrix calculate the average """
    result=numpy.zeros((cm.shape[0],cm.shape[1]))
    for i in range(0,cm.shape[0]):
        for j in range(0,cm.shape[1]):
            result[i][j]=cm[i][j]/ (cm[i , :].sum()+cm[: , j].sum() - cm[i][j] )
    return result
            
def kfolder_confusion(clf,corpus,label_features,total_classes,n_folds=10):
    """ Do a K fold validation and compute the average confusion matrix """
    kf = KFold(len(label_features), n_folds=n_folds, indices=False)
    #initialize an empty confusion matrix
    partial_sum=numpy.zeros((total_classes,total_classes))

    for train, test in kf:

        train_data = corpus[train==True]
        test_data = corpus[test==False]
        train_label=label_features[train==True]
        test_label=label_features[test==False]
        clf.fit(train_data, train_label)
        label_pred= clf.predict(test_data)
        # Compute confusion matrix for each fold
        cm = confusion_matrix(test_label, label_pred)
        # Keep the temporary sym
        partial_sum=numpy.add(cm,partial_sum)

    average_cm=average_matrix(partial_sum)

    # Show the average confusion matrix
    pl.matshow(average_cm)
    pl.title('Average confusion matrix')
    pl.colorbar()
    pl.ylabel('True label')
    pl.xlabel('Predicted label')
    pl.show()

------------------------------------------------------------------------------
November Webinars for C, C++, Fortran Developers
Accelerate application performance with scalable programming models. Explore
techniques for threading, error checking, porting, and tuning. Get the most 
from the latest Intel processors and coprocessors. See abstracts and register
http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to