this is the class imbalance problem (ie you have many more instances for one class than another one).
in this case, you could ensure that the training set was balanced (50:50); more interestingly, you can have a prior which corrects for this. or, you could over-sample or even under-sample the training set, etc etc. Miles 2009/7/22 Grant Ingersoll <[email protected]> > <done_basking>Grant</done_basking> > > Here's an interesting piece: > 09/07/22 18:23:02 INFO bayes.TestClassifier: > Testing:wikipedia/subjects/prepared-test/history.txt > 09/07/22 18:23:07 INFO bayes.TestClassifier: history 95.458984375 > 3910/4096.0 > 09/07/22 18:23:07 INFO bayes.TestClassifier: -------------- > 09/07/22 18:23:07 INFO bayes.TestClassifier: > Testing:/wikipedia/subjects/prepared-test/science.txt > 09/07/22 18:23:08 INFO bayes.TestClassifier: science 15.554072096128172 > 233/1498.0 > 09/07/22 18:23:08 INFO bayes.TestClassifier: > ======================================================= > > > In other words, I'm really good at predicting History as a category and > really bad at predicting Science. > > I think the following might help explain why: > ls -l > total 245360 > -rwxrwxrwx 1 grantingersoll staff 89518235 Jul 22 17:53 history.txt* > -rwxrwxrwx 1 grantingersoll staff 36099183 Jul 22 17:53 science.txt* > > The number of history examples is almost double the number of science based > on my test set. > > There is obviously a teaching moment here. I know there is a lot out there > about sample sizes, feature selection etc., can we boil some of these down > into some cogent recommendations for our users? > > > -Grant > > On Jul 22, 2009, at 5:23 PM, Grant Ingersoll wrote: > > <basking>Grant</basking> >> >> On Jul 22, 2009, at 4:46 PM, Ted Dunning wrote: >> >> Getting something to run is a big step. It is important to bask in the >>> glow >>> for a tiny moment. >>> >>> On Wed, Jul 22, 2009 at 1:05 PM, Grant Ingersoll <[email protected] >>> >wrote: >>> >>> Confusion Matrix >>>> ------------------------------------------------------- >>>> a b <--Classified as >>>> 3910 186 | 4096 a = history >>>> 1265 233 | 1498 b = science >>>> Default Category: unknown: 2 >>>> </snip> >>>> >>>> At least it's better than 50%, which is presumably a good thing ;-) I >>>> have >>>> no clue what the state of the art is these days, but it doesn't seem >>>> _horrendous_ either. >>>> >>>> I'd love to see someone validate what I have done. Let me know if you >>>> need >>>> more details. I'd also like to know how I can improve it. >>>> >>>> >>> >>> >>> -- >>> Ted Dunning, CTO >>> DeepDyve >>> >> >> >> > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using > Solr/Lucene: > http://www.lucidimagination.com/search > > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
