Author: apalumbo
Date: Thu Mar 19 19:32:46 2015
New Revision: 1667854
URL: http://svn.apache.org/r1667854
Log:
Fixed labels as vector of vectors. Test commit from local CMS manager
Modified:
mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext
Modified:
mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext
URL:
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext?rev=1667854&r1=1667853&r2=1667854&view=diff
==============================================================================
--- mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext
(original)
+++ mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext
Thu Mar 19 19:32:46 2015
@@ -1,24 +1,24 @@
# Naive Bayes
-### Intro
+## Intro
Mahout currently has two Naive Bayes implementations. The first is standard
Multinomial Naive Bayes. The second is an implementation of Transformed
Weight-normalized Complement Naive Bayes as introduced by Rennie et al.
[[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf). We refer to
the former as Bayes and the latter as CBayes.
Where Bayes has long been a standard in text classification, CBayes is an
extension of Bayes that performs particularly well on datasets with skewed
classes and has been shown to be competitive with algorithms of higher
complexity such as Support Vector Machines.
-### Implementations
+## Implementations
Both Bayes and CBayes are currently trained via MapReduce Jobs. Testing and
classification can be done via a MapReduce Job or sequentially. Mahout
provides CLI drivers for preprocessing, training and testing. A Spark
implementation is currently in the works
([MAHOUT-1493](https://issues.apache.org/jira/browse/MAHOUT-1493)).
-### Preprocessing and Algorithm
+## Preprocessing and Algorithm
As described in
[[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf) Mahout Naive
Bayes is broken down into the following steps (assignments are over all
possible index values):
- Let `\(\vec{d}=(\vec{d_1},...,\vec{d_n})\)` be a set of documents;
`\(d_{ij}\)` is the count of word `\(i\)` in document `\(j\)`.
-- Let `\(\vec{y}=(\vec{y_1},...,\vec{y_n})\)` be their labels.
+- Let `\(\vec{y}=(y_1,...,y_n)\)` be their labels.
- Let `\(\alpha_i\)` be a smoothing parameter for all words in the vocabulary;
let `\(\alpha=\sum_i{\alpha_i}\)`.
-- **Preprocessing**: TF-IDF transformation and L2 length normalization of
`\(\vec{d}\)`
+- **Preprocessing**(via seq2Sparse) TF-IDF transformation and L2 length
normalization of `\(\vec{d}\)`
1. `\(d_{ij} = \sqrt{d_{ij}}\)`
2. `\(d_{ij} =
d_{ij}\left(\log{\frac{\sum_k1}{\sum_k\delta_{ik}+1}}+1\right)\)`
3. `\(d_{ij} =\frac{d_{ij}}{\sqrt{\sum_k{d_{kj}^2}}}\)`
@@ -35,7 +35,7 @@ As described in [[1]](http://people.csai
As we can see, the main difference between Bayes and CBayes is the weight
calculation step. Where Bayes weighs terms more heavily based on the
likelihood that they belong to class `\(c\)`, CBayes seeks to maximize term
weights on the likelihood that they do not belong to any other class.
-### Running from the command line
+## Running from the command line
Mahout provides CLI drivers for all above steps. Here we will give a simple
overview of Mahout CLI commands used to preprocess the data, train the model
and assign labels to the training set. An [example
script](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh)
is given for the full process from data acquisition through classification of
the classic [20 Newsgroups
corpus](https://mahout.apache.org/users/classification/twenty-newsgroups.html).
@@ -72,7 +72,7 @@ Classification and testing on a holdout
-c
-seq
-### Command line options
+## Command line options
- **Preprocessing:**
@@ -131,13 +131,13 @@ Classification and testing on a holdout
--endPhase endPhase Last phase to run
-### Examples
+## Examples
Mahout provides an example for Naive Bayes classification:
1. [Classify 20 Newsgroups](twenty-newsgroups.html)
-### References
+## References
[1]: Jason D. M. Rennie, Lawerence Shih, Jamie Teevan, David Karger (2003).
[Tackling the Poor Assumptions of Naive Bayes Text
Classifiers](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf).
Proceedings of the Twentieth International Conference on Machine Learning
(ICML-2003).