Hi, I just tried it with Mallet.
http://mallet.cs.umass.edu/index.php/Main_Page I used the same training and testing files (on the 20News corpus) and got an 85% prediction accuracy. However, I also tired it on Mallet with my usual Enron corpus and only got a 50% accuracy. I would say that there is probably something wrong with the Mahout classifier implementation. Also, probably that the training data that I use with the Enron data-set is not distinct enough to be used with a Bayesian classifier. Any ideas? Thanks, Philippe. On Sun, Jul 20, 2008 at 11:23 AM, Philippe Lamarche <[EMAIL PROTECTED]> wrote: > Hi, > > I uploaded my split here: > > http://www.2shared.com/file/3624998/e9330a64/news-train-testtar.html > > (the download link is after all the ads, at the bottom of the page) > > The file contains the "news_test_1" and "news_train_1" folders, with > the original file/folder structure. The "news_ha_train_1" folder > contains the collapse version of "news_train_1". > > The training files are not perfectly distributed in each class (some > class will contain less training file than other). This was done to > reflect the UC Berkeley Enron corpus. > > Thanks, > Philippe. > > > On Sun, Jul 20, 2008 at 10:08 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: >> I haven't done a lot of testing w/ M-9 yet, so it is more than likely there >> are bugs ;-) >> >> -Grant >> >> On Jul 20, 2008, at 6:21 AM, Miles Osborne wrote: >> >>> i think it would also be useful to cross-check your results against a text >>> classification system which is known to work. look at rainbow: >>> >>> http://www.cs.cmu.edu/~mccallum/bow/rainbow/ >>> >>> if you get the correct results here then either you have somehow messed-up >>> with Mahout or else there really is a bug >>> >>> Miles >>> >>> 2008/7/20 Robin Anil <[EMAIL PROTECTED]>: >>> >>>> Can you upload your split somewhere. >>>> >>>> On Sun, Jul 20, 2008 at 6:46 AM, Philippe Lamarche < >>>> [EMAIL PROTECTED]> wrote: >>>> >>>>> Now, with the attachment. >>>>> Sorry. >>>>> >>>>> On Sat, Jul 19, 2008 at 9:13 PM, Philippe Lamarche >>>>> <[EMAIL PROTECTED]> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> I have been working for a little while with Mahout and the Bayesian >>>>>> classifier for a school project. >>>>>> >>>>>> I am using the Enron email corpus and the UC Berkeley classified >>>>>> emails (http://www.cs.cmu.edu/~enron/<http://www.cs.cmu.edu/%7Eenron/>< >>>> >>>> http://www.cs.cmu.edu/%7Eenron/>). >>>>> >>>>> I did a few tests and I can't >>>>>> >>>>>> seem to make it work. I wonder if I am doing something wrong. >>>>>> >>>>>> For example, I am getting correct prediction under 10%, with Bayes and >>>>>> around 1% with CBayes. The problem seems to lie in the fact that all >>>>>> instances of a class will be predicted to another class, or that they >>>>>> will all be predicted to the class containing the more feature. >>>>>> >>>>>> I also tested with the 20News corpus and I get similar result where >>>>>> all instances of a class will be predicted to another class. (e.g. all >>>>>> 421 "rec.motorcycles" get predicted as "talk.politics.mideast"). >>>>>> Attached is two confusions matrix displaying results for bayes and >>>>>> cbayes. Both used the same division in the training and testing set. >>>>>> >>>>>> Am I doing something wrong? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Philippe Lamarche. >>>>>> >>>>> >>>> >>>> >>>> Thanks >>>> Robin >>>> >>> >>> >>> >>> -- >>> The University of Edinburgh is a charitable body, registered in Scotland, >>> with registration number SC005336. >> >> -------------------------- >> Grant Ingersoll >> http://www.lucidimagination.com >> >> Lucene Helpful Hints: >> http://wiki.apache.org/lucene-java/BasicsOfPerformance >> http://wiki.apache.org/lucene-java/LuceneFAQ >> >> >> >> >> >> >> >> >
