Re: [jira] Updated: (MAHOUT-442) Simple feature reduction options for Bayes classifiers

Robin Anil Thu, 22 Jul 2010 12:08:56 -0700

Yes makes perfect sense
Try it with a test set!=train set. The performance could improve due to lack
of overfitting. Otherwise looks good to go


sent from nexus one

On Jul 22, 2010 12:00 PM, "Drew Farris (JIRA)" <[email protected]> wrote:
>
> [
https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Drew Farris updated MAHOUT-442:
> -------------------------------
>
> Attachment: MAHOUT-442-20news-comparison.txt
>
> Here's the confusion matrices for a untrimmed run against 20-news and run
against 20-news with --minDf=2 and --minSupport=2
>
> The trimmed version did not do as well as the untrimmed in this case:
>
> Untrimmed:
> =======================================================
> Summary
> -------------------------------------------------------
> Correctly Classified Instances : 18305 97.2222%
> Incorrectly Classified Instances : 523 2.7778%
> Total Classified Instances : 18828
>
> Trimmed:
> =======================================================
> Summary
> -------------------------------------------------------
> Correctly Classified Instances : 18085 96.0537%
> Incorrectly Classified Instances : 743 3.9463%
> Total Classified Instances : 18828
>
>
>
>> Simple feature reduction options for Bayes classifiers
>> ------------------------------------------------------
>>
>> Key: MAHOUT-442
>> URL: https://issues.apache.org/jira/browse/MAHOUT-442
>> Project: Mahout
>> Issue Type: Improvement
>> Components: Classification
>> Affects Versions: 0.3
>> Reporter: Drew Farris
>> Assignee: Drew Farris
>> Attachments: MAHOUT-442-20news-comparison.txt, MAHOUT-442.patch
>>
>>
>> Adding options to the Bayes TrainClassifier driver to filter features
using minimum df or tf. Features that only appear in a handful of documents
or less than X times within the entire input set will be removed from the
training feature set entirely. This will allow the Bayes classifiers to
scale to larger corpora.
>> More background:
>> When running the wikipedia example, I discovered that the number of
features produced with -ng 1 was pretty outstanding: 9,500,000 using the
default settings after running the following commands:
>> {code}
>> ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d
wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
>> ./bin/mahout
org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i
wikipedia/chunks -o wikipedia/bayes-input -c
examples/src/test/resources/country.txt
>> ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i
wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1 -source
hdfs
>> {code}
>> This if course makes testing the classifier tricky on machines of modest
means because TestClassifier attempts to load all features into memory on
the machines the mapper is running on.
>> It appears that Grant ran into a similar issue last year:
>>
http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c
>> This patch will add --minDf and --minSupport options to TrainClassifier.
Also --skipCleanup to prevent the deletion of the output of the
BayesFeatureDriver, which can be useful in order to allow inspection the
resulting feature set in order to tune rules for feature production.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>

Re: [jira] Updated: (MAHOUT-442) Simple feature reduction options for Bayes classifiers

Reply via email to