Yeah, the result looks like the paper. But unlike the paper, we have bigram
support, so result might look a wee bit better with it and even more so with
pruning.

Robin

On Thu, Jul 22, 2010 at 9:31 PM, Ted Dunning <[email protected]> wrote:

> This looks much more in line with the figures in Rennie's paper (86% best
> score, if I remember) and the numbers that I get for the SGD system running
> on the bytime version of the 20 newsgroups (about 83-85%).  The bytime
> version of the corpus has test documents that were segregated by time which
> mirrors normal operations a little bit better than random selection.  It
> also has a few duplicate documents removed.
>
> On Thu, Jul 22, 2010 at 8:32 PM, Drew Farris (JIRA) <[email protected]>
> wrote:
>
> >
> >     [
> >
> https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
> >
> > Drew Farris updated MAHOUT-442:
> > -------------------------------
> >
> >    Attachment: MAHOUT-442-20news-comparison.txt
> >
> > Held back 100 documents from each newsgroup -- the results look a bit
> > better.
> >
> > Untrimmed;
> >
> > =======================================================
> > Summary
> > -------------------------------------------------------
> > Correctly Classified Instances          :       1698          84.9%
> > Incorrectly Classified Instances        :        302          15.1%
> > Total Classified Instances              :       2000
> >
> > =======================================================
> >
> >
> > Trimmed:
> > =======================================================
> > Summary
> > -------------------------------------------------------
> > Correctly Classified Instances          :       1705         85.25%
> > Incorrectly Classified Instances        :        295         14.75%
> > Total Classified Instances              :       2000
> >
> > =======================================================
> > Confusion Matrix
> > -------------------------------------------------------
> >
> > > Simple feature reduction options for Bayes classifiers
> > > ------------------------------------------------------
> > >
> > >                 Key: MAHOUT-442
> > >                 URL: https://issues.apache.org/jira/browse/MAHOUT-442
> > >             Project: Mahout
> > >          Issue Type: Improvement
> > >          Components: Classification
> > >    Affects Versions: 0.3
> > >            Reporter: Drew Farris
> > >            Assignee: Drew Farris
> > >         Attachments: MAHOUT-442-20news-comparison.txt,
> > MAHOUT-442-20news-comparison.txt, MAHOUT-442.patch
> > >
> > >
> > > Adding options to the Bayes TrainClassifier driver to filter features
> > using minimum df or tf. Features that only appear in a handful of
> documents
> > or less than X times within the entire input set will be removed from the
> > training feature set entirely. This will allow the Bayes classifiers to
> > scale to larger corpora.
> > > More background:
> > > When running the wikipedia example, I discovered that the number of
> > features produced with -ng 1 was pretty outstanding: 9,500,000 using the
> > default settings after running the following commands:
> > > {code}
> > > ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d
> > wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
> > > ./bin/mahout
> > org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i
> > wikipedia/chunks -o wikipedia/bayes-input -c
> > examples/src/test/resources/country.txt
> > > ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i
> > wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1
>  -source
> > hdfs
> > > {code}
> > > This if course makes testing the classifier tricky on machines of
> modest
> > means because TestClassifier attempts to load all features into memory on
> > the machines the mapper is running on.
> > > It appears that Grant ran into a similar issue last year:
> > >
> >
> http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c
> > > This patch will add --minDf and --minSupport options to
> TrainClassifier.
> > Also --skipCleanup to prevent the deletion of the output of the
> > BayesFeatureDriver, which can be useful in order to allow inspection the
> > resulting feature set in order to tune rules for feature production.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
>

Reply via email to