Re: Getting Started with Classification

Ted Dunning Wed, 22 Jul 2009 13:58:41 -0700

Also reasonable to hand-judge 20 docs from each of the four cells of the
confusion matrix.  That will give you a rough idea of what the error
processes are.


On Wed, Jul 22, 2009 at 1:13 PM, Miles Osborne <[email protected]> wrote:

> it is probably good to benchmark against standard datasets.  for text
> classification this tends to be the Reuters set:
>
> http://www.daviddlewis.com/resources/testcollections/
>
> this way you know if you are doing a good job
>
> Miles
>
> 2009/7/22 Grant Ingersoll <[email protected]>
>
> > The model size is much smaller with unigrams.  :-)
> >
> > I'm not quite sure what constitutes good just yet, but, I can report the
> > following using the commands I reported earlier w/ the exception that I
> am
> > using unigrams:
> >
> > I have two categories:  History and Science
> >
> > 0. Splitter:
> > org.apache.mahout.classifier.bayes.WikipediaXmlSplitter
> > --dumpFile PATH/wikipedia/enwiki-20070527-pages-articles.xml --outputDir
> > /PATH/wikipedia/chunks -c 64
> >
> > Then prep:
> > org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver
> > --input PATH/wikipedia/test-chunks/ --output PATH/wikipedia/subjects/test
> > --categories PATH/mahout-clean/examples/src/test/resources/subjects.txt
> > (also do this for the training set)
> >
> > 1. Train set:
> > ls ../chunks
> > chunk-0001.xml  chunk-0005.xml  chunk-0009.xml  chunk-0013.xml
> >  chunk-0017.xml  chunk-0021.xml  chunk-0025.xml  chunk-0029.xml
> >  chunk-0033.xml  chunk-0037.xml
> > chunk-0002.xml  chunk-0006.xml  chunk-0010.xml  chunk-0014.xml
> >  chunk-0018.xml  chunk-0022.xml  chunk-0026.xml  chunk-0030.xml
> >  chunk-0034.xml  chunk-0038.xml
> > chunk-0003.xml  chunk-0007.xml  chunk-0011.xml  chunk-0015.xml
> >  chunk-0019.xml  chunk-0023.xml  chunk-0027.xml  chunk-0031.xml
> >  chunk-0035.xml  chunk-0039.xml
> > chunk-0004.xml  chunk-0008.xml  chunk-0012.xml  chunk-0016.xml
> >  chunk-0020.xml  chunk-0024.xml  chunk-0028.xml  chunk-0032.xml
> >  chunk-0036.xml
> >
> > 2. Test Set:
> >  ls
> > chunk-0101.xml  chunk-0103.xml  chunk-0105.xml  chunk-0108.xml
> >  chunk-0130.xml  chunk-0132.xml  chunk-0134.xml  chunk-0137.xml
> > chunk-0102.xml  chunk-0104.xml  chunk-0107.xml  chunk-0109.xml
> >  chunk-0131.xml  chunk-0133.xml  chunk-0135.xml  chunk-0139.xml
> >
> > 3. Run the Trainer on the train set:
> > --input PATH/wikipedia/subjects/out --output
> PATH/wikipedia/subjects/model
> > --gramSize 1 --classifierType bayes
> >
> > 4. Run the TestClassifier.
> >
> > --model PATH/wikipedia/subjects/model --testDir
> > PATH/wikipedia/subjects/test --gramSize 1 --classifierType bayes
> >
> > Output is:
> >
> > <snip>
> > 9/07/22 15:55:09 INFO bayes.TestClassifier:
> > =======================================================
> > Summary
> > -------------------------------------------------------
> > Correctly Classified Instances          :       4143       74.0615%
> > Incorrectly Classified Instances        :       1451       25.9385%
> > Total Classified Instances              :       5594
> >
> > =======================================================
> > Confusion Matrix
> > -------------------------------------------------------
> > a       b       <--Classified as
> > 3910    186      |  4096        a     = history
> > 1265    233      |  1498        b     = science
> > Default Category: unknown: 2
> > </snip>
> >
> > At least it's better than 50%, which is presumably a good thing ;-)  I
> have
> > no clue what the state of the art is these days, but it doesn't seem
> > _horrendous_ either.
> >
> > I'd love to see someone validate what I have done.  Let me know if you
> need
> > more details.  I'd also like to know how I can improve it.
> >
> > On Jul 22, 2009, at 3:15 PM, Ted Dunning wrote:
> >
> >  Indeed.  I hadn't snapped to the fact you were using trigrams.
> >>
> >> 30 million features is quite plausible for that.  To effectively use
> long
> >> n-grams as features in classification of documents you really need to
> have
> >> the following:
> >>
> >> a) good statistical methods for resolving what is useful and what is
> not.
> >> Everybody here knows that my preference for a first hack is
> sparsification
> >> with log-likelihood ratios.
> >>
> >> b) some kind of smoothing using smaller n-grams
> >>
> >> c) some kind of smoothing over variants of n-grams.
> >>
> >> AFAIK, mahout doesn't have many or any of these in place.  You are
> likely
> >> to
> >> do better with unigrams as a result.
> >>
> >> On Wed, Jul 22, 2009 at 11:39 AM, Grant Ingersoll <[email protected]
> >> >wrote:
> >>
> >>  I suspect the explosion in the number of features, Ted, is due to the
> use
> >>> of n-grams producing a lot of unique terms.  I can try w/ gramSize = 1,
> >>> that
> >>> will likely reduce the feature set quite a bit.
> >>>
> >>>
> >>
> >>
> >> --
> >> Ted Dunning, CTO
> >> DeepDyve
> >>
> >
> >
> >
>
>
> --
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Getting Started with Classification

Reply via email to