Yeah, the result looks like the paper. But unlike the paper, we have bigram support, so result might look a wee bit better with it and even more so with pruning.
Robin On Thu, Jul 22, 2010 at 9:31 PM, Ted Dunning <[email protected]> wrote: > This looks much more in line with the figures in Rennie's paper (86% best > score, if I remember) and the numbers that I get for the SGD system running > on the bytime version of the 20 newsgroups (about 83-85%). The bytime > version of the corpus has test documents that were segregated by time which > mirrors normal operations a little bit better than random selection. It > also has a few duplicate documents removed. > > On Thu, Jul 22, 2010 at 8:32 PM, Drew Farris (JIRA) <[email protected]> > wrote: > > > > > [ > > > https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > ] > > > > Drew Farris updated MAHOUT-442: > > ------------------------------- > > > > Attachment: MAHOUT-442-20news-comparison.txt > > > > Held back 100 documents from each newsgroup -- the results look a bit > > better. > > > > Untrimmed; > > > > ======================================================= > > Summary > > ------------------------------------------------------- > > Correctly Classified Instances : 1698 84.9% > > Incorrectly Classified Instances : 302 15.1% > > Total Classified Instances : 2000 > > > > ======================================================= > > > > > > Trimmed: > > ======================================================= > > Summary > > ------------------------------------------------------- > > Correctly Classified Instances : 1705 85.25% > > Incorrectly Classified Instances : 295 14.75% > > Total Classified Instances : 2000 > > > > ======================================================= > > Confusion Matrix > > ------------------------------------------------------- > > > > > Simple feature reduction options for Bayes classifiers > > > ------------------------------------------------------ > > > > > > Key: MAHOUT-442 > > > URL: https://issues.apache.org/jira/browse/MAHOUT-442 > > > Project: Mahout > > > Issue Type: Improvement > > > Components: Classification > > > Affects Versions: 0.3 > > > Reporter: Drew Farris > > > Assignee: Drew Farris > > > Attachments: MAHOUT-442-20news-comparison.txt, > > MAHOUT-442-20news-comparison.txt, MAHOUT-442.patch > > > > > > > > > Adding options to the Bayes TrainClassifier driver to filter features > > using minimum df or tf. Features that only appear in a handful of > documents > > or less than X times within the entire input set will be removed from the > > training feature set entirely. This will allow the Bayes classifiers to > > scale to larger corpora. > > > More background: > > > When running the wikipedia example, I discovered that the number of > > features produced with -ng 1 was pretty outstanding: 9,500,000 using the > > default settings after running the following commands: > > > {code} > > > ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d > > wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64 > > > ./bin/mahout > > org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i > > wikipedia/chunks -o wikipedia/bayes-input -c > > examples/src/test/resources/country.txt > > > ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i > > wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1 > -source > > hdfs > > > {code} > > > This if course makes testing the classifier tricky on machines of > modest > > means because TestClassifier attempts to load all features into memory on > > the machines the mapper is running on. > > > It appears that Grant ran into a similar issue last year: > > > > > > http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c > > > This patch will add --minDf and --minSupport options to > TrainClassifier. > > Also --skipCleanup to prevent the deletion of the output of the > > BayesFeatureDriver, which can be useful in order to allow inspection the > > resulting feature set in order to tune rules for feature production. > > > > -- > > This message is automatically generated by JIRA. > > - > > You can reply to this email to add a comment to the issue online. > > > > >
