Hi Robin, I found out that I had a problem with my Mahout setup. While reinstalling from svn and applying the patches, I found out that MAHOUT-9 add stuff to "core/src" while MAHOUT-60 add stuff to "src".
On my setup, the ant script "build.xml" is included in "core" and it will ignore completely anything added by MAHOUT-60: the target "compile-examples" will use a source path of "[workspace]/core/src/main/examples/bayes" while the file added by MAHOUT-60 are in "[workspace]/src/main/examples/bayes". After sorting this out, I was able to make CBayes work. With an accuracy of over 90% on the split I provided earlier. This is impressive! I wonder why I am getting a bigger score than was you posted here. However, I am still having trouble with the Enron corpus: everything is predicted to either of the two classes with the highest weight normalization, "1_1" and "1_4" (I might be totally wrong with assumption, "1_1" and "1_4" might be selected out of luck...). Here is a link to a split I made out the UC Berkely annotated Enron corpus. The emails are edited, in a way that they don't contain the header, which gave me a little accuracy augmentation while testing with Mallet. The archive also include logs from my tests on CBayes with both split. http://www.2shared.com/file/3671623/69773258/TrainAndTesttar.html thanks, Philippe. On Sun, Jul 27, 2008 at 7:40 PM, Philippe Lamarche <[EMAIL PROTECTED]> wrote: > Hi, > > I am glad to see that to see you were able to make it working, I will > try it as soon as possible. Probably something went wrong while > downloading/applying/updating Mahout-60. > > I am using the UC Berkeley annotated subset from that you can find in > your link, here: > http://bailando.sims.berkeley.edu/enron/enron_with_categories.tar.gz > from here http://bailando.sims.berkeley.edu/enron_email.html. > > It's a multiple level label, each message can have a: > Coarse genre, > Included/forwarded information, > Primary topics, > Emotional tone (if not neutral) > > There is a .cats file associated with each label. > > I made a little utility that let you pick a label type, parse the cats > file and output the message in appropriate labeled folder. Also, it's > easy to just use the 1 to 8 subfolders in the tar, these folders are > labeled by coarse genre. I can share this little app, if you want. > > I am very curious to see if I will be able to make it work. > > Thanks for the help, > Philippe > > > On Sun, Jul 27, 2008 at 11:29 AM, Robin Anil <[EMAIL PROTECTED]> wrote: >> Also could you tell me which version of the enron Email corpus are you using >> for classification. Please provide the link. I found tons of variations >> online. What classification labels are you using (Email User Name?). >> http://sgi.nu/enron/corpora.php >> >
