it is probably good to benchmark against standard datasets. for text classification this tends to be the Reuters set:
http://www.daviddlewis.com/resources/testcollections/ this way you know if you are doing a good job Miles 2009/7/22 Grant Ingersoll <[email protected]> > The model size is much smaller with unigrams. :-) > > I'm not quite sure what constitutes good just yet, but, I can report the > following using the commands I reported earlier w/ the exception that I am > using unigrams: > > I have two categories: History and Science > > 0. Splitter: > org.apache.mahout.classifier.bayes.WikipediaXmlSplitter > --dumpFile PATH/wikipedia/enwiki-20070527-pages-articles.xml --outputDir > /PATH/wikipedia/chunks -c 64 > > Then prep: > org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver > --input PATH/wikipedia/test-chunks/ --output PATH/wikipedia/subjects/test > --categories PATH/mahout-clean/examples/src/test/resources/subjects.txt > (also do this for the training set) > > 1. Train set: > ls ../chunks > chunk-0001.xml chunk-0005.xml chunk-0009.xml chunk-0013.xml > chunk-0017.xml chunk-0021.xml chunk-0025.xml chunk-0029.xml > chunk-0033.xml chunk-0037.xml > chunk-0002.xml chunk-0006.xml chunk-0010.xml chunk-0014.xml > chunk-0018.xml chunk-0022.xml chunk-0026.xml chunk-0030.xml > chunk-0034.xml chunk-0038.xml > chunk-0003.xml chunk-0007.xml chunk-0011.xml chunk-0015.xml > chunk-0019.xml chunk-0023.xml chunk-0027.xml chunk-0031.xml > chunk-0035.xml chunk-0039.xml > chunk-0004.xml chunk-0008.xml chunk-0012.xml chunk-0016.xml > chunk-0020.xml chunk-0024.xml chunk-0028.xml chunk-0032.xml > chunk-0036.xml > > 2. Test Set: > ls > chunk-0101.xml chunk-0103.xml chunk-0105.xml chunk-0108.xml > chunk-0130.xml chunk-0132.xml chunk-0134.xml chunk-0137.xml > chunk-0102.xml chunk-0104.xml chunk-0107.xml chunk-0109.xml > chunk-0131.xml chunk-0133.xml chunk-0135.xml chunk-0139.xml > > 3. Run the Trainer on the train set: > --input PATH/wikipedia/subjects/out --output PATH/wikipedia/subjects/model > --gramSize 1 --classifierType bayes > > 4. Run the TestClassifier. > > --model PATH/wikipedia/subjects/model --testDir > PATH/wikipedia/subjects/test --gramSize 1 --classifierType bayes > > Output is: > > <snip> > 9/07/22 15:55:09 INFO bayes.TestClassifier: > ======================================================= > Summary > ------------------------------------------------------- > Correctly Classified Instances : 4143 74.0615% > Incorrectly Classified Instances : 1451 25.9385% > Total Classified Instances : 5594 > > ======================================================= > Confusion Matrix > ------------------------------------------------------- > a b <--Classified as > 3910 186 | 4096 a = history > 1265 233 | 1498 b = science > Default Category: unknown: 2 > </snip> > > At least it's better than 50%, which is presumably a good thing ;-) I have > no clue what the state of the art is these days, but it doesn't seem > _horrendous_ either. > > I'd love to see someone validate what I have done. Let me know if you need > more details. I'd also like to know how I can improve it. > > On Jul 22, 2009, at 3:15 PM, Ted Dunning wrote: > > Indeed. I hadn't snapped to the fact you were using trigrams. >> >> 30 million features is quite plausible for that. To effectively use long >> n-grams as features in classification of documents you really need to have >> the following: >> >> a) good statistical methods for resolving what is useful and what is not. >> Everybody here knows that my preference for a first hack is sparsification >> with log-likelihood ratios. >> >> b) some kind of smoothing using smaller n-grams >> >> c) some kind of smoothing over variants of n-grams. >> >> AFAIK, mahout doesn't have many or any of these in place. You are likely >> to >> do better with unigrams as a result. >> >> On Wed, Jul 22, 2009 at 11:39 AM, Grant Ingersoll <[email protected] >> >wrote: >> >> I suspect the explosion in the number of features, Ted, is due to the use >>> of n-grams producing a lot of unique terms. I can try w/ gramSize = 1, >>> that >>> will likely reduce the feature set quite a bit. >>> >>> >> >> >> -- >> Ted Dunning, CTO >> DeepDyve >> > > > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
