Hi all, I would appreciate it if a couple of maintainers could take a look at my pull request (https://github.com/scikit-learn/scikit-learn/pull/8190) implementing the Complement Naive Bayes (CNB) classifier described in Rennie et al. (2003). CNB regularly outperforms the standard Multinomial Naive Bayes (MNB) classifier on real world data sets due to the tendency for real world data sets to suffer from class imbalance. Apache Mahout offers its own implementation of CNB alongside MNB, but it would be nice to have an easily usable CNB implementation available in scikit-learn.
Training the CNB classifier on a reasonably sized data set of 493,038 documents with a median length of 87 tokens and 1,155,784 distinct tokens took around 8.5 seconds. For comparison, the MNB classifier took around 4.5 seconds to train, but the CNB had a 10% lower error rate, a seemingly worthwhile tradeoff. Happy to answer any questions or discuss further. Thanks, Michael A. Alcorn
_______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
