bayesian.html

buildbot Thu, 19 Mar 2015 12:33:07 -0700

Author: buildbot
Date: Thu Mar 19 19:32:51 2015
New Revision: 944372

Log:
Staging update by buildbot for mahout


Modified:
    websites/staging/mahout/trunk/content/   (props changed)
    websites/staging/mahout/trunk/content/users/classification/bayesian.html

Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Thu Mar 19 19:32:51 2015
@@ -1 +1 @@
-1665101
+1667854

Modified: 
websites/staging/mahout/trunk/content/users/classification/bayesian.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/classification/bayesian.html 
(original)
+++ websites/staging/mahout/trunk/content/users/classification/bayesian.html 
Thu Mar 19 19:32:51 2015
@@ -246,18 +246,18 @@
   <div id="content-wrap" class="clearfix">
    <div id="main">
     <h1 id="naive-bayes">Naive Bayes</h1>
-<h3 id="intro">Intro</h3>
+<h2 id="intro">Intro</h2>
 <p>Mahout currently has two Naive Bayes implementations.  The first is 
standard Multinomial Naive Bayes. The second is an implementation of 
Transformed Weight-normalized Complement Naive Bayes as introduced by Rennie et 
al. <a href="http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf";>[1]</a>. 
We refer to the former as Bayes and the latter as CBayes.</p>
 <p>Where Bayes has long been a standard in text classification, CBayes is an 
extension of Bayes that performs particularly well on datasets with skewed 
classes and has been shown to be competitive with algorithms of higher 
complexity such as Support Vector Machines. </p>
-<h3 id="implementations">Implementations</h3>
+<h2 id="implementations">Implementations</h2>
 <p>Both Bayes and CBayes are currently trained via MapReduce Jobs. Testing and 
classification can be done via a MapReduce Job or sequentially.  Mahout 
provides CLI drivers for preprocessing, training and testing. A Spark 
implementation is currently in the works (<a 
href="https://issues.apache.org/jira/browse/MAHOUT-1493";>MAHOUT-1493</a>).</p>
-<h3 id="preprocessing-and-algorithm">Preprocessing and Algorithm</h3>
+<h2 id="preprocessing-and-algorithm">Preprocessing and Algorithm</h2>
 <p>As described in <a 
href="http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf";>[1]</a> Mahout 
Naive Bayes is broken down into the following steps (assignments are over all 
possible index values):  </p>
 <ul>
 <li>Let <code>\(\vec{d}=(\vec{d_1},...,\vec{d_n})\)</code> be a set of 
documents; <code>\(d_{ij}\)</code> is the count of word <code>\(i\)</code> in 
document <code>\(j\)</code>.</li>
-<li>Let <code>\(\vec{y}=(\vec{y_1},...,\vec{y_n})\)</code> be their 
labels.</li>
+<li>Let <code>\(\vec{y}=(y_1,...,y_n)\)</code> be their labels.</li>
 <li>Let <code>\(\alpha_i\)</code> be a smoothing parameter for all words in 
the vocabulary; let <code>\(\alpha=\sum_i{\alpha_i}\)</code>. </li>
-<li><strong>Preprocessing</strong>: TF-IDF transformation and L2 length 
normalization of <code>\(\vec{d}\)</code><ol>
+<li><strong>Preprocessing</strong>(via seq2Sparse) TF-IDF transformation and 
L2 length normalization of <code>\(\vec{d}\)</code><ol>
 <li><code>\(d_{ij} = \sqrt{d_{ij}}\)</code> </li>
 <li><code>\(d_{ij} = 
d_{ij}\left(\log{\frac{\sum_k1}{\sum_k\delta_{ik}+1}}+1\right)\)</code> </li>
 <li><code>\(d_{ij} =\frac{d_{ij}}{\sqrt{\sum_k{d_{kj}^2}}}\)</code> </li>
@@ -281,7 +281,7 @@
 </li>
 </ul>
 <p>As we can see, the main difference between Bayes and CBayes is the weight 
calculation step.  Where Bayes weighs terms more heavily based on the 
likelihood that they belong to class <code>\(c\)</code>, CBayes seeks to 
maximize term weights on the likelihood that they do not belong to any other 
class.  </p>
-<h3 id="running-from-the-command-line">Running from the command line</h3>
+<h2 id="running-from-the-command-line">Running from the command line</h2>
 <p>Mahout provides CLI drivers for all above steps.  Here we will give a 
simple overview of Mahout CLI commands used to preprocess the data, train the 
model and assign labels to the training set. An <a 
href="https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh";>example
 script</a> is given for the full process from data acquisition through 
classification of the classic <a 
href="https://mahout.apache.org/users/classification/twenty-newsgroups.html";>20 
Newsgroups corpus</a>.  </p>
 <ul>
 <li>
@@ -327,7 +327,7 @@ Classification and testing on a holdout
 
 </li>
 </ul>
-<h3 id="command-line-options">Command line options</h3>
+<h2 id="command-line-options">Command line options</h2>
 <ul>
 <li><strong>Preprocessing:</strong></li>
 </ul>
@@ -393,12 +393,12 @@ Classification and testing on a holdout
 
 </li>
 </ul>
-<h3 id="examples">Examples</h3>
+<h2 id="examples">Examples</h2>
 <p>Mahout provides an example for Naive Bayes classification:</p>
 <ol>
 <li><a href="twenty-newsgroups.html">Classify 20 Newsgroups</a></li>
 </ol>
-<h3 id="references">References</h3>
+<h2 id="references">References</h2>
 <p>[1]: Jason D. M. Rennie, Lawerence Shih, Jamie Teevan, David Karger (2003). 
<a href="http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf";>Tackling the 
Poor Assumptions of Naive Bayes Text Classifiers</a>. Proceedings of the 
Twentieth International Conference on Machine Learning (ICML-2003).</p>
    </div>
   </div>

svn commit: r944372 - in /websites/staging/mahout/trunk/content: ./ users/classification/bayesian.html

Reply via email to