Thanks for the tip Robin - I was wondering what was the difference between the 2 but was unable to find anything on them. On this topic is there anything else I should be aware of between the 2 models?�Bayes Algorithm: good for ??CBayes algorithm: good for multiclass classification (categories > 2)�
----- Original Message ----- From: "Robin Anil" To: [email protected] Subject: Re: Document size rules of thumb Date: Thu, 8 Oct 2009 13:39:20 +0530 one more tip: You will encounter better results with cbayes algorithm instead of bayes algorithm for multiclass classification(categories>2) On Thu, Oct 8, 2009 at 1:37 PM, Robin Anil wrote: > > > On Thu, Oct 8, 2009 at 1:33 PM, Sandra Clover wrote: > >> Hi Ted, Thanks for the response. To answer your questions: 1. I have >> 576 categories2. I started with 5 training document per category. Went up >> to 10 but error levels ramained the same. Am going to up to 30 documents >> and am going to increase the length of the documents. How did you derive >> the 50 words of training data for some topics? Curious... S. >> >> > 30 documents is too less if words overlap across categories and you dont > have enought discriminative words for each categories. > > Again with 576 categories you need really good discriminative words in each > category to be able to cover all the unknown documents you wish to classify > > ----- Original Message ----- >> From: "Ted Dunning" >> To: [email protected] >> Subject: Re: Document size rules of thumb >> Date: Wed, 7 Oct 2009 10:21:20 -0700 >> >> >> Sandra, >> >> This is a classic case of over-fitting. I suspect training data >> inadequacy. One thing you don't say is how many categories you have >> and how >> many training documents per categories you have. You point (2) might >> indicate that you have as little as 50 words of training data for >> some >> topics. That would make it difficult for even the best classifiers to >> get a >> sharp result. >> >> I would recommend the following: >> >> a) get more training data (always a good thing even if often >> infeasible) >> >> b) try a few other algorithms. I would recommend trying Luduan (from >> my >> dissertation, pdf sent to you in a separate email), confidence >> weighted >> learning (see http://www.cs.jhu.edu/~mdredze/publications/, >> especially >> http://www.aclweb.org/anthology-new/D/D09/D09-1052.pdf) and vowpal ( >> http://hunch.net/~vw/) >> >> c) post your data for others to try >> >> Hope this helps. >> >> On Wed, Oct 7, 2009 at 9:37 AM, Sandra Clover wrote: >> >> > 0. The setup is Mahout 0.1 & Hadoop 0.19.2 – I think I am using a >> > branch version. Currently trying to install the trunk version >> > >> > 1. The data I am trying to classify is from scientific papers - >> > essentially the abstract title, text and keywords of there paper - >> > example below >> > >> > 2. No data source is under 300 characters >> > >> > 3. I am training using the Mahout naive Bayes and am getting low >> > incorrectly classified rates something like: 1.67% - I’m quite >> happy >> > with that… >> > >> > 4. After I have trained the model Robin I use the Mahout naive >> Bayes >> > classify() method to classify new (unseen) data (with the >> classification >> > already known) - this is where I start to get problems - I get very >> poor >> > successful classification rates for new data. Something like: 82% >> > unsuccessful classified. >> > >> > >> > >> > To Summarise: I get very good results in training and very poor >> results >> > with new data. >> > >> >> >> >> -- >> Ted Dunning, CTO >> DeepDyve >> >> -- >> Be Yourself @ mail.com! >> Choose From 200+ Email Addresses >> Get a Free Account at www.mail.com! >> >> > -- Be Yourself @ mail.com! Choose From 200+ Email Addresses Get a Free Account at www.mail.com!
