Does this look good to commit as-is in that case?
On Wed, Jul 28, 2010 at 12:20 PM, Robin Anil <[email protected]> wrote: > thats too small a text to apply pruning. Should run it without pruning. Its > good that, it croaked when changing the code. Its a sanity check to see if > things are running alright :) > > > On Tue, Jul 27, 2010 at 9:42 PM, Drew Farris (JIRA) <[email protected]> wrote: > >> >> [ >> https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] >> >> Drew Farris updated MAHOUT-442: >> ------------------------------- >> >> Attachment: MAHOUT-442.patch >> >> Latest patch cleans up a couple issues. Not too sure what do to about the >> BayesClassifierSelfTest, when run with minDf and minSupport set to 2 it >> produces some pretty nasty results, which is not necessarilly a surprise, >> but lessens the utility of changing the test to apply these parameters in >> the first place. >> >> > Simple feature reduction options for Bayes classifiers >> > ------------------------------------------------------ >> > >> > Key: MAHOUT-442 >> > URL: https://issues.apache.org/jira/browse/MAHOUT-442 >> > Project: Mahout >> > Issue Type: Improvement >> > Components: Classification >> > Affects Versions: 0.3 >> > Reporter: Drew Farris >> > Assignee: Drew Farris >> > Attachments: MAHOUT-442-20news-comparison.txt, >> MAHOUT-442-20news-comparison.txt, MAHOUT-442.patch, MAHOUT-442.patch >> > >> > >> > Adding options to the Bayes TrainClassifier driver to filter features >> using minimum df or tf. Features that only appear in a handful of documents >> or less than X times within the entire input set will be removed from the >> training feature set entirely. This will allow the Bayes classifiers to >> scale to larger corpora. >> > More background: >> > When running the wikipedia example, I discovered that the number of >> features produced with -ng 1 was pretty outstanding: 9,500,000 using the >> default settings after running the following commands: >> > {code} >> > ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d >> wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64 >> > ./bin/mahout >> org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i >> wikipedia/chunks -o wikipedia/bayes-input -c >> examples/src/test/resources/country.txt >> > ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i >> wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1 -source >> hdfs >> > {code} >> > This if course makes testing the classifier tricky on machines of modest >> means because TestClassifier attempts to load all features into memory on >> the machines the mapper is running on. >> > It appears that Grant ran into a similar issue last year: >> > >> http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c >> > This patch will add --minDf and --minSupport options to TrainClassifier. >> Also --skipCleanup to prevent the deletion of the output of the >> BayesFeatureDriver, which can be useful in order to allow inspection the >> resulting feature set in order to tune rules for feature production. >> >> -- >> This message is automatically generated by JIRA. >> - >> You can reply to this email to add a comment to the issue online. >> >> >
