Now, with the attachment. Sorry. On Sat, Jul 19, 2008 at 9:13 PM, Philippe Lamarche <[EMAIL PROTECTED]> wrote: > Hi, > > I have been working for a little while with Mahout and the Bayesian > classifier for a school project. > > I am using the Enron email corpus and the UC Berkeley classified > emails (http://www.cs.cmu.edu/~enron/). I did a few tests and I can't > seem to make it work. I wonder if I am doing something wrong. > > For example, I am getting correct prediction under 10%, with Bayes and > around 1% with CBayes. The problem seems to lie in the fact that all > instances of a class will be predicted to another class, or that they > will all be predicted to the class containing the more feature. > > I also tested with the 20News corpus and I get similar result where > all instances of a class will be predicted to another class. (e.g. all > 421 "rec.motorcycles" get predicted as "talk.politics.mideast"). > Attached is two confusions matrix displaying results for bayes and > cbayes. Both used the same division in the training and testing set. > > Am I doing something wrong? > > Thanks, > > Philippe Lamarche. >
=-=-=-=-=-=-=-=-=-=This is for 20News CBayes=-=-=-=-=-=-=-=-=-=
| | alt.atheism |
comp.graphics | comp.os.ms-windows.misc | comp.sys.ibm.pc.hardware |
comp.sys.mac.hardware | comp.windows.x | misc.forsale |
rec.autos | rec.motorcycles |
rec.sport.baseball | rec.sport.hockey | sci.crypt |
sci.electronics | sci.med | sci.space
| soc.religion.christian | talk.politics.guns |
talk.politics.mideast | talk.politics.misc | talk.religion.misc |
| alt.atheism | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 214 | 0
| 0 |
| comp.graphics | 0 |
421 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0
| 0 |
| comp.os.ms-windows.misc | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 421 | 0 |
0 | 0 | 0 |
0 | 0 | 0
| 0 |
| comp.sys.ibm.pc.hardware | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 421 | 0 |
0 | 0 | 0 |
0 | 0 | 0
| 0 |
| comp.sys.mac.hardware | 0 |
0 | 0 | 0 |
421 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0
| 0 |
| comp.windows.x | 0 |
0 | 0 | 0 |
0 | 421 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0
| 0 |
| misc.forsale | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 121 | 0 |
0 | 0 | 0
| 0 |
| rec.autos | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
421 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0
| 0 |
| rec.motorcycles | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 421 | 0
| 0 |
| rec.sport.baseball | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 85 | 0
| 0 |
| rec.sport.hockey | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
57 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0
| 0 |
| sci.crypt | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 421 | 0 |
0 | 0 | 0 |
0 | 0 | 0
| 0 |
| sci.electronics | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 192 |
0 | 0 | 0 |
0 | 0 | 0
| 0 |
| sci.med | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
421 | 0 | 0 |
0 | 0 | 0
| 0 |
| sci.space | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 158 | 0 |
0 | 0 | 0
| 0 |
| soc.religion.christian | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 421 |
0 | 0 | 0
| 0 |
| talk.politics.guns | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 421
| 0 |
| talk.politics.mideast | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 421 | 0
| 0 |
| talk.politics.misc | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 354
| 0 |
| talk.religion.misc | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 207
| 0 |
Correctly classified 0.6411490683229814
4129/6440=-=-=-=-=-=-=-=-=-=This is for 20News CBayes=-=-=-=-=-=-=-=-=-= ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances : 57 0.8851% Incorrectly Classified Instances : 6383 99.1149% Total Classified Instances : 6440 ======================================================= Confusion Matrix ------------------------------------------------------- a b c d e f g h i j k l m n o p q r s t <--Classified as 0 0 0 0 0 0 0 421 0 0 0 0 0 0 0 0 0 0 0 0 | 421 a = rec.motorcycles 0 0 0 0 0 0 0 421 0 0 0 0 0 0 0 0 0 0 0 0 | 421 b = comp.windows.x 0 0 0 0 0 0 0 421 0 0 0 0 0 0 0 0 0 0 0 0 | 421 c = talk.politics.mideast 0 0 0 0 0 0 0 421 0 0 0 0 0 0 0 0 0 0 0 0 | 421 d = talk.politics.guns 0 0 0 0 0 0 0 207 0 0 0 0 0 0 0 0 0 0 0 0 | 207 e = talk.religion.misc 0 0 0 0 0 0 0 421 0 0 0 0 0 0 0 0 0 0 0 0 | 421 f = rec.autos 0 0 0 0 0 0 0 85 0 0 0 0 0 0 0 0 0 0 0 0 | 85 g = rec.sport.baseball 0 0 0 0 0 0 0 57 0 0 0 0 0 0 0 0 0 0 0 0 | 57 h = rec.sport.hockey 0 0 0 0 0 0 0 421 0 0 0 0 0 0 0 0 0 0 0 0 | 421 i = comp.sys.mac.hardware 0 0 0 0 0 0 0 158 0 0 0 0 0 0 0 0 0 0 0 0 | 158 j = sci.space 0 0 0 0 0 0 0 421 0 0 0 0 0 0 0 0 0 0 0 0 | 421 k = comp.sys.ibm.pc.hardware 0 0 0 0 0 0 0 354 0 0 0 0 0 0 0 0 0 0 0 0 | 354 l = talk.politics.misc 0 0 0 0 0 0 0 421 0 0 0 0 0 0 0 0 0 0 0 0 | 421 m = comp.graphics 0 0 0 0 0 0 0 192 0 0 0 0 0 0 0 0 0 0 0 0 | 192 n = sci.electronics 0 0 0 0 0 0 0 421 0 0 0 0 0 0 0 0 0 0 0 0 | 421 o = soc.religion.christian 0 0 0 0 0 0 0 421 0 0 0 0 0 0 0 0 0 0 0 0 | 421 p = sci.med 0 0 0 0 0 0 0 421 0 0 0 0 0 0 0 0 0 0 0 0 | 421 q = sci.crypt 0 0 0 0 0 0 0 214 0 0 0 0 0 0 0 0 0 0 0 0 | 214 r = alt.atheism 0 0 0 0 0 0 0 121 0 0 0 0 0 0 0 0 0 0 0 0 | 121 s = misc.forsale 0 0 0 0 0 0 0 421 0 0 0 0 0 0 0 0 0 0 0 0 | 421 t = comp.os.ms-windows.misc
