[ 
https://issues.apache.org/jira/browse/MAHOUT-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-442:
-------------------------------

    Attachment: MAHOUT-442.patch


Core changes:
   * BayesFeatureMapper now collects tf as FEATURE_TF. df is already collected 
as FEATURE_COUNT. 
   * Introduced BayesFeatureCombiner to do simple combination since the 
operations performed by the reducer are more complex.
   * FeaturePartitioner ensures that all tuples for a given feature 
(term/ngram) are directed to the same reducer.
   * FeatureLabelComparator ensures that FEATURE_TF and FEATURE_COUNT arrive at 
the reducer prior to any other tuples, and that all tuples for a given feature 
are processed consecutively. 
   * BayesFeatureReducer now does filtering on all tuples based on TF and DF 
configured using --minSupport and --minDf, passed in as a part of the 
BayesParameters object.
   * deprecated the BayesParameters(ngramSize) constructor in favor of 
setNgramSize, setMinDf, setMinSupport methods.
   * Included unit test for BayesFeature mapreduce process.
   * All other unit tests pass.

Other changes:
   * A couple fixes for cases where the BayesParameters weren't printing 
properly
   * Plumbing for the new command-line options.







> Simple feature reduction options for Bayes clasification 
> ---------------------------------------------------------
>
>                 Key: MAHOUT-442
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-442
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Drew Farris
>            Assignee: Drew Farris
>         Attachments: MAHOUT-442.patch
>
>
> Adding options to the Bayes TrainClassifier driver to filter features using 
> minimum df or tf. Features that only appear in a handful of documents or less 
> than X times within the entire input set will be removed from the training 
> feature set entirely. This will allow the Bayes classifiers to scale to 
> larger corpora.
> More background: 
> When running the wikipedia example, I discovered that the number of features 
> produced with -ng 1 was pretty outstanding: 9,500,000 using the default 
> settings after running the following commands:
> {code}
> ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d 
> wikipedia/enwiki-20100622-pages-articles.xml.bz2 -owikipedia/chunks -c 64
> ./bin/mahout org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver 
> -i wikipedia/chunks -o wikipedia/bayes-input -c 
> examples/src/test/resources/country.txt
> ./bin/mahout org.apache.mahout.classifier.bayes.TrainClassifier -i 
> wikipedia/bayes-input -o wikipedia/bayes-model -type cbayes -ng 1  -source 
> hdfs
> {code}
> This if course makes testing the classifier tricky on machines of modest 
> means because TestClassifier attempts to load all features into memory on the 
> machines the mapper is running on.
> It appears that Grant ran into a similar issue last year: 
> http://www.lucidimagination.com/search/document/7fff9bc0b3350370/getting_started_with_classification#ba6838a9c8b9090c
> This patch will add --minDf and --minSupport options to TrainClassifier. Also 
> --skipCleanup to prevent the deletion of the output of the 
> BayesFeatureDriver, which can be useful in order to allow inspection the 
> resulting feature set in order to tune rules for feature production.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to