Also reasonable to hand-judge 20 docs from each of the four cells of the confusion matrix. That will give you a rough idea of what the error processes are.
On Wed, Jul 22, 2009 at 1:13 PM, Miles Osborne <[email protected]> wrote: > it is probably good to benchmark against standard datasets. for text > classification this tends to be the Reuters set: > > http://www.daviddlewis.com/resources/testcollections/ > > this way you know if you are doing a good job > > Miles > > 2009/7/22 Grant Ingersoll <[email protected]> > > > The model size is much smaller with unigrams. :-) > > > > I'm not quite sure what constitutes good just yet, but, I can report the > > following using the commands I reported earlier w/ the exception that I > am > > using unigrams: > > > > I have two categories: History and Science > > > > 0. Splitter: > > org.apache.mahout.classifier.bayes.WikipediaXmlSplitter > > --dumpFile PATH/wikipedia/enwiki-20070527-pages-articles.xml --outputDir > > /PATH/wikipedia/chunks -c 64 > > > > Then prep: > > org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver > > --input PATH/wikipedia/test-chunks/ --output PATH/wikipedia/subjects/test > > --categories PATH/mahout-clean/examples/src/test/resources/subjects.txt > > (also do this for the training set) > > > > 1. Train set: > > ls ../chunks > > chunk-0001.xml chunk-0005.xml chunk-0009.xml chunk-0013.xml > > chunk-0017.xml chunk-0021.xml chunk-0025.xml chunk-0029.xml > > chunk-0033.xml chunk-0037.xml > > chunk-0002.xml chunk-0006.xml chunk-0010.xml chunk-0014.xml > > chunk-0018.xml chunk-0022.xml chunk-0026.xml chunk-0030.xml > > chunk-0034.xml chunk-0038.xml > > chunk-0003.xml chunk-0007.xml chunk-0011.xml chunk-0015.xml > > chunk-0019.xml chunk-0023.xml chunk-0027.xml chunk-0031.xml > > chunk-0035.xml chunk-0039.xml > > chunk-0004.xml chunk-0008.xml chunk-0012.xml chunk-0016.xml > > chunk-0020.xml chunk-0024.xml chunk-0028.xml chunk-0032.xml > > chunk-0036.xml > > > > 2. Test Set: > > ls > > chunk-0101.xml chunk-0103.xml chunk-0105.xml chunk-0108.xml > > chunk-0130.xml chunk-0132.xml chunk-0134.xml chunk-0137.xml > > chunk-0102.xml chunk-0104.xml chunk-0107.xml chunk-0109.xml > > chunk-0131.xml chunk-0133.xml chunk-0135.xml chunk-0139.xml > > > > 3. Run the Trainer on the train set: > > --input PATH/wikipedia/subjects/out --output > PATH/wikipedia/subjects/model > > --gramSize 1 --classifierType bayes > > > > 4. Run the TestClassifier. > > > > --model PATH/wikipedia/subjects/model --testDir > > PATH/wikipedia/subjects/test --gramSize 1 --classifierType bayes > > > > Output is: > > > > <snip> > > 9/07/22 15:55:09 INFO bayes.TestClassifier: > > ======================================================= > > Summary > > ------------------------------------------------------- > > Correctly Classified Instances : 4143 74.0615% > > Incorrectly Classified Instances : 1451 25.9385% > > Total Classified Instances : 5594 > > > > ======================================================= > > Confusion Matrix > > ------------------------------------------------------- > > a b <--Classified as > > 3910 186 | 4096 a = history > > 1265 233 | 1498 b = science > > Default Category: unknown: 2 > > </snip> > > > > At least it's better than 50%, which is presumably a good thing ;-) I > have > > no clue what the state of the art is these days, but it doesn't seem > > _horrendous_ either. > > > > I'd love to see someone validate what I have done. Let me know if you > need > > more details. I'd also like to know how I can improve it. > > > > On Jul 22, 2009, at 3:15 PM, Ted Dunning wrote: > > > > Indeed. I hadn't snapped to the fact you were using trigrams. > >> > >> 30 million features is quite plausible for that. To effectively use > long > >> n-grams as features in classification of documents you really need to > have > >> the following: > >> > >> a) good statistical methods for resolving what is useful and what is > not. > >> Everybody here knows that my preference for a first hack is > sparsification > >> with log-likelihood ratios. > >> > >> b) some kind of smoothing using smaller n-grams > >> > >> c) some kind of smoothing over variants of n-grams. > >> > >> AFAIK, mahout doesn't have many or any of these in place. You are > likely > >> to > >> do better with unigrams as a result. > >> > >> On Wed, Jul 22, 2009 at 11:39 AM, Grant Ingersoll <[email protected] > >> >wrote: > >> > >> I suspect the explosion in the number of features, Ted, is due to the > use > >>> of n-grams producing a lot of unique terms. I can try w/ gramSize = 1, > >>> that > >>> will likely reduce the feature set quite a bit. > >>> > >>> > >> > >> > >> -- > >> Ted Dunning, CTO > >> DeepDyve > >> > > > > > > > > > -- > The University of Edinburgh is a charitable body, registered in Scotland, > with registration number SC005336. > -- Ted Dunning, CTO DeepDyve
