Hi, I have been working for a little while with Mahout and the Bayesian classifier for a school project.
I am using the Enron email corpus and the UC Berkeley classified emails (http://www.cs.cmu.edu/~enron/). I did a few tests and I can't seem to make it work. I wonder if I am doing something wrong. For example, I am getting correct prediction under 10%, with Bayes and around 1% with CBayes. The problem seems to lie in the fact that all instances of a class will be predicted to another class, or that they will all be predicted to the class containing the more feature. I also tested with the 20News corpus and I get similar result where all instances of a class will be predicted to another class. (e.g. all 421 "rec.motorcycles" get predicted as "talk.politics.mideast"). Attached is two confusions matrix displaying results for bayes and cbayes. Both used the same division in the training and testing set. Am I doing something wrong? Thanks, Philippe Lamarche.
