thats the only diff between the two. On Thu, Oct 8, 2009 at 7:08 PM, Sandra Clover <[email protected]>wrote:
> Thanks for the tip Robin - I was wondering what was the difference > between the 2 but was unable to find anything on them. On this topic is > there anything else I should be aware of between the 2 models? Bayes > Algorithm: good for ??CBayes algorithm: good for multiclass > classification (categories > 2) > > ----- Original Message ----- > From: "Robin Anil" > To: [email protected] > Subject: Re: Document size rules of thumb > Date: Thu, 8 Oct 2009 13:39:20 +0530 > > > one more tip: You will encounter better results with cbayes algorithm > instead of bayes algorithm for multiclass > classification(categories>2) > > On Thu, Oct 8, 2009 at 1:37 PM, Robin Anil wrote: > > > > > > > On Thu, Oct 8, 2009 at 1:33 PM, Sandra Clover wrote: > > > >> Hi Ted, Thanks for the response. To answer your questions: 1. I > have > >> 576 categories2. I started with 5 training document per category. > Went up > >> to 10 but error levels ramained the same. Am going to up to 30 > documents > >> and am going to increase the length of the documents. How did you > derive > >> the 50 words of training data for some topics? Curious... S. > >> > >> > > 30 documents is too less if words overlap across categories and you > dont > > have enought discriminative words for each categories. > > > > Again with 576 categories you need really good discriminative words > in each > > category to be able to cover all the unknown documents you wish to > classify > > > > ----- Original Message ----- > >> From: "Ted Dunning" > >> To: [email protected] > >> Subject: Re: Document size rules of thumb > >> Date: Wed, 7 Oct 2009 10:21:20 -0700 > >> > >> > >> Sandra, > >> > >> This is a classic case of over-fitting. I suspect training data > >> inadequacy. One thing you don't say is how many categories you > have > >> and how > >> many training documents per categories you have. You point (2) > might > >> indicate that you have as little as 50 words of training data for > >> some > >> topics. That would make it difficult for even the best classifiers > to > >> get a > >> sharp result. > >> > >> I would recommend the following: > >> > >> a) get more training data (always a good thing even if often > >> infeasible) > >> > >> b) try a few other algorithms. I would recommend trying Luduan > (from > >> my > >> dissertation, pdf sent to you in a separate email), confidence > >> weighted > >> learning (see http://www.cs.jhu.edu/~mdredze/publications/, > >> especially > >> http://www.aclweb.org/anthology-new/D/D09/D09-1052.pdf) and vowpal > ( > >> http://hunch.net/~vw/) > >> > >> c) post your data for others to try > >> > >> Hope this helps. > >> > >> On Wed, Oct 7, 2009 at 9:37 AM, Sandra Clover wrote: > >> > >> > 0. The setup is Mahout 0.1 & Hadoop 0.19.2 – I think I am > using a > >> > branch version. Currently trying to install the trunk version > >> > > >> > 1. The data I am trying to classify is from scientific papers - > >> > essentially the abstract title, text and keywords of there paper > - > >> > example below > >> > > >> > 2. No data source is under 300 characters > >> > > >> > 3. I am training using the Mahout naive Bayes and am getting low > >> > incorrectly classified rates something like: 1.67% - I’m > quite > >> happy > >> > with that… > >> > > >> > 4. After I have trained the model Robin I use the Mahout naive > >> Bayes > >> > classify() method to classify new (unseen) data (with the > >> classification > >> > already known) - this is where I start to get problems - I get > very > >> poor > >> > successful classification rates for new data. Something like: > 82% > >> > unsuccessful classified. > >> > > >> > > >> > > >> > To Summarise: I get very good results in training and very poor > >> results > >> > with new data. > >> > > >> > >> > >> > >> -- > >> Ted Dunning, CTO > >> DeepDyve > >> > >> -- > >> Be Yourself @ mail.com! > >> Choose From 200+ Email Addresses > >> Get a Free Account at www.mail.com! > >> > >> > > > > -- > Be Yourself @ mail.com! > Choose From 200+ Email Addresses > Get a Free Account at www.mail.com! > >
