i think it would also be useful to cross-check your results against a text classification system which is known to work. look at rainbow:
http://www.cs.cmu.edu/~mccallum/bow/rainbow/ if you get the correct results here then either you have somehow messed-up with Mahout or else there really is a bug Miles 2008/7/20 Robin Anil <[EMAIL PROTECTED]>: > Can you upload your split somewhere. > > On Sun, Jul 20, 2008 at 6:46 AM, Philippe Lamarche < > [EMAIL PROTECTED]> wrote: > > > Now, with the attachment. > > Sorry. > > > > On Sat, Jul 19, 2008 at 9:13 PM, Philippe Lamarche > > <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > > > I have been working for a little while with Mahout and the Bayesian > > > classifier for a school project. > > > > > > I am using the Enron email corpus and the UC Berkeley classified > > > emails (http://www.cs.cmu.edu/~enron/<http://www.cs.cmu.edu/%7Eenron/>< > http://www.cs.cmu.edu/%7Eenron/>). > > I did a few tests and I can't > > > seem to make it work. I wonder if I am doing something wrong. > > > > > > For example, I am getting correct prediction under 10%, with Bayes and > > > around 1% with CBayes. The problem seems to lie in the fact that all > > > instances of a class will be predicted to another class, or that they > > > will all be predicted to the class containing the more feature. > > > > > > I also tested with the 20News corpus and I get similar result where > > > all instances of a class will be predicted to another class. (e.g. all > > > 421 "rec.motorcycles" get predicted as "talk.politics.mideast"). > > > Attached is two confusions matrix displaying results for bayes and > > > cbayes. Both used the same division in the training and testing set. > > > > > > Am I doing something wrong? > > > > > > Thanks, > > > > > > Philippe Lamarche. > > > > > > > > Thanks > Robin > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
