bayesian.mdtext

apalumbo Thu, 19 Mar 2015 12:33:09 -0700

Author: apalumbo
Date: Thu Mar 19 19:32:46 2015
New Revision: 1667854

URL: http://svn.apache.org/r1667854
Log:
Fixed labels as vector of vectors.  Test commit from local CMS manager


Modified:
    mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext

Modified: 
mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext
URL: 
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext?rev=1667854&r1=1667853&r2=1667854&view=diff
==============================================================================
--- mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext 
(original)
+++ mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext 
Thu Mar 19 19:32:46 2015
@@ -1,24 +1,24 @@
 # Naive Bayes
 
 
-### Intro
+## Intro
 
 Mahout currently has two Naive Bayes implementations.  The first is standard 
Multinomial Naive Bayes. The second is an implementation of Transformed 
Weight-normalized Complement Naive Bayes as introduced by Rennie et al. 
[[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf). We refer to 
the former as Bayes and the latter as CBayes.
 
 Where Bayes has long been a standard in text classification, CBayes is an 
extension of Bayes that performs particularly well on datasets with skewed 
classes and has been shown to be competitive with algorithms of higher 
complexity such as Support Vector Machines. 
 
 
-### Implementations
+## Implementations
 Both Bayes and CBayes are currently trained via MapReduce Jobs. Testing and 
classification can be done via a MapReduce Job or sequentially.  Mahout 
provides CLI drivers for preprocessing, training and testing. A Spark 
implementation is currently in the works 
([MAHOUT-1493](https://issues.apache.org/jira/browse/MAHOUT-1493)).
 
-### Preprocessing and Algorithm
+## Preprocessing and Algorithm
 
 As described in 
[[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf) Mahout Naive 
Bayes is broken down into the following steps (assignments are over all 
possible index values):  
 
 - Let `\(\vec{d}=(\vec{d_1},...,\vec{d_n})\)` be a set of documents; 
`\(d_{ij}\)` is the count of word `\(i\)` in document `\(j\)`.
-- Let `\(\vec{y}=(\vec{y_1},...,\vec{y_n})\)` be their labels.
+- Let `\(\vec{y}=(y_1,...,y_n)\)` be their labels.
 - Let `\(\alpha_i\)` be a smoothing parameter for all words in the vocabulary; 
let `\(\alpha=\sum_i{\alpha_i}\)`. 
-- **Preprocessing**: TF-IDF transformation and L2 length normalization of 
`\(\vec{d}\)`
+- **Preprocessing**(via seq2Sparse) TF-IDF transformation and L2 length 
normalization of `\(\vec{d}\)`
     1. `\(d_{ij} = \sqrt{d_{ij}}\)` 
     2. `\(d_{ij} = 
d_{ij}\left(\log{\frac{\sum_k1}{\sum_k\delta_{ik}+1}}+1\right)\)` 
     3. `\(d_{ij} =\frac{d_{ij}}{\sqrt{\sum_k{d_{kj}^2}}}\)` 
@@ -35,7 +35,7 @@ As described in [[1]](http://people.csai
 
 As we can see, the main difference between Bayes and CBayes is the weight 
calculation step.  Where Bayes weighs terms more heavily based on the 
likelihood that they belong to class `\(c\)`, CBayes seeks to maximize term 
weights on the likelihood that they do not belong to any other class.  
 
-### Running from the command line
+## Running from the command line
 
 Mahout provides CLI drivers for all above steps.  Here we will give a simple 
overview of Mahout CLI commands used to preprocess the data, train the model 
and assign labels to the training set. An [example 
script](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh)
 is given for the full process from data acquisition through classification of 
the classic [20 Newsgroups 
corpus](https://mahout.apache.org/users/classification/twenty-newsgroups.html). 
 
 
@@ -72,7 +72,7 @@ Classification and testing on a holdout
           -c 
           -seq
 
-### Command line options
+## Command line options
 
 - **Preprocessing:**
   
@@ -131,13 +131,13 @@ Classification and testing on a holdout
           --endPhase endPhase              Last phase to run  
 
 
-### Examples
+## Examples
 
 Mahout provides an example for Naive Bayes classification:
 
 1. [Classify 20 Newsgroups](twenty-newsgroups.html)
  
-### References
+## References
 
 [1]: Jason D. M. Rennie, Lawerence Shih, Jamie Teevan, David Karger (2003). 
[Tackling the Poor Assumptions of Naive Bayes Text 
Classifiers](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf). 
Proceedings of the Twentieth International Conference on Machine Learning 
(ICML-2003).

svn commit: r1667854 - /mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext

Reply via email to