Does this look good to commit as-is in that case?

On Wed, Jul 28, 2010 at 12:20 PM, Robin Anil <[email protected]> wrote:
> thats too small a text to apply pruning. Should run it without pruning. Its
> good that, it croaked when changing the code. Its a sanity check to see if
> things are running alright :)
>
>
> On Tue, Jul 27, 2010 at 9:42 PM, Drew Farris (JIRA) <[email protected]> wrote:
>
>>
>>     [
>> https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>>
>> Drew Farris updated MAHOUT-442:
>> -------------------------------
>>
>>     Attachment: MAHOUT-442.patch
>>
>> Latest patch cleans up a couple issues. Not too sure what do to about the
>> BayesClassifierSelfTest, when run with minDf and minSupport set to 2 it
>> produces some pretty nasty results, which is not necessarilly a surprise,
>> but lessens the utility of changing the test to apply these parameters in
>> the first place.
>>
>> > Simple feature reduction options for Bayes classifiers
>> > ------------------------------------------------------
>> >
>> >                 Key: MAHOUT-442
>> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-442
>> >             Project: Mahout
>> >          Issue Type: Improvement
>> >          Components: Classification
>> >    Affects Versions: 0.3
>> >            Reporter: Drew Farris
>> >            Assignee: Drew Farris
>> >         Attachments: MAHOUT-442-20news-comparison.txt,
>> MAHOUT-442-20news-comparison.txt, MAHOUT-442.patch, MAHOUT-442.patch
>> >
>> >
>> > Adding options to the Bayes TrainClassifier driver to filter features
>> using minimum df or tf. Features that only appear in a handful of documents
>> or less than X times within the entire input set will be removed from the
>> training feature set entirely. This will allow the Bayes classifiers to
>> scale to larger corpora.
>> > More background:
>> > When running the wikipedia example, I discovered that the number of
>> features produced with -ng 1 was pretty outstanding: 9,500,000 using the
>> default settings after running the following commands:
>> > {code}
>> > ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d
>> wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
>> > ./bin/mahout
>> org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i
>> wikipedia/chunks -o wikipedia/bayes-input -c
>> examples/src/test/resources/country.txt
>> > ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i
>> wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1  -source
>> hdfs
>> > {code}
>> > This if course makes testing the classifier tricky on machines of modest
>> means because TestClassifier attempts to load all features into memory on
>> the machines the mapper is running on.
>> > It appears that Grant ran into a similar issue last year:
>> >
>> http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c
>> > This patch will add --minDf and --minSupport options to TrainClassifier.
>> Also --skipCleanup to prevent the deletion of the output of the
>> BayesFeatureDriver, which can be useful in order to allow inspection the
>> resulting feature set in order to tune rules for feature production.
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>

Reply via email to