Author: buildbot
Date: Thu Mar 19 19:32:51 2015
New Revision: 944372
Log:
Staging update by buildbot for mahout
Modified:
websites/staging/mahout/trunk/content/ (props changed)
websites/staging/mahout/trunk/content/users/classification/bayesian.html
Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Thu Mar 19 19:32:51 2015
@@ -1 +1 @@
-1665101
+1667854
Modified:
websites/staging/mahout/trunk/content/users/classification/bayesian.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/classification/bayesian.html
(original)
+++ websites/staging/mahout/trunk/content/users/classification/bayesian.html
Thu Mar 19 19:32:51 2015
@@ -246,18 +246,18 @@
<div id="content-wrap" class="clearfix">
<div id="main">
<h1 id="naive-bayes">Naive Bayes</h1>
-<h3 id="intro">Intro</h3>
+<h2 id="intro">Intro</h2>
<p>Mahout currently has two Naive Bayes implementations. The first is
standard Multinomial Naive Bayes. The second is an implementation of
Transformed Weight-normalized Complement Naive Bayes as introduced by Rennie et
al. <a href="http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf">[1]</a>.
We refer to the former as Bayes and the latter as CBayes.</p>
<p>Where Bayes has long been a standard in text classification, CBayes is an
extension of Bayes that performs particularly well on datasets with skewed
classes and has been shown to be competitive with algorithms of higher
complexity such as Support Vector Machines. </p>
-<h3 id="implementations">Implementations</h3>
+<h2 id="implementations">Implementations</h2>
<p>Both Bayes and CBayes are currently trained via MapReduce Jobs. Testing and
classification can be done via a MapReduce Job or sequentially. Mahout
provides CLI drivers for preprocessing, training and testing. A Spark
implementation is currently in the works (<a
href="https://issues.apache.org/jira/browse/MAHOUT-1493">MAHOUT-1493</a>).</p>
-<h3 id="preprocessing-and-algorithm">Preprocessing and Algorithm</h3>
+<h2 id="preprocessing-and-algorithm">Preprocessing and Algorithm</h2>
<p>As described in <a
href="http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf">[1]</a> Mahout
Naive Bayes is broken down into the following steps (assignments are over all
possible index values): </p>
<ul>
<li>Let <code>\(\vec{d}=(\vec{d_1},...,\vec{d_n})\)</code> be a set of
documents; <code>\(d_{ij}\)</code> is the count of word <code>\(i\)</code> in
document <code>\(j\)</code>.</li>
-<li>Let <code>\(\vec{y}=(\vec{y_1},...,\vec{y_n})\)</code> be their
labels.</li>
+<li>Let <code>\(\vec{y}=(y_1,...,y_n)\)</code> be their labels.</li>
<li>Let <code>\(\alpha_i\)</code> be a smoothing parameter for all words in
the vocabulary; let <code>\(\alpha=\sum_i{\alpha_i}\)</code>. </li>
-<li><strong>Preprocessing</strong>: TF-IDF transformation and L2 length
normalization of <code>\(\vec{d}\)</code><ol>
+<li><strong>Preprocessing</strong>(via seq2Sparse) TF-IDF transformation and
L2 length normalization of <code>\(\vec{d}\)</code><ol>
<li><code>\(d_{ij} = \sqrt{d_{ij}}\)</code> </li>
<li><code>\(d_{ij} =
d_{ij}\left(\log{\frac{\sum_k1}{\sum_k\delta_{ik}+1}}+1\right)\)</code> </li>
<li><code>\(d_{ij} =\frac{d_{ij}}{\sqrt{\sum_k{d_{kj}^2}}}\)</code> </li>
@@ -281,7 +281,7 @@
</li>
</ul>
<p>As we can see, the main difference between Bayes and CBayes is the weight
calculation step. Where Bayes weighs terms more heavily based on the
likelihood that they belong to class <code>\(c\)</code>, CBayes seeks to
maximize term weights on the likelihood that they do not belong to any other
class. </p>
-<h3 id="running-from-the-command-line">Running from the command line</h3>
+<h2 id="running-from-the-command-line">Running from the command line</h2>
<p>Mahout provides CLI drivers for all above steps. Here we will give a
simple overview of Mahout CLI commands used to preprocess the data, train the
model and assign labels to the training set. An <a
href="https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh">example
script</a> is given for the full process from data acquisition through
classification of the classic <a
href="https://mahout.apache.org/users/classification/twenty-newsgroups.html">20
Newsgroups corpus</a>. </p>
<ul>
<li>
@@ -327,7 +327,7 @@ Classification and testing on a holdout
</li>
</ul>
-<h3 id="command-line-options">Command line options</h3>
+<h2 id="command-line-options">Command line options</h2>
<ul>
<li><strong>Preprocessing:</strong></li>
</ul>
@@ -393,12 +393,12 @@ Classification and testing on a holdout
</li>
</ul>
-<h3 id="examples">Examples</h3>
+<h2 id="examples">Examples</h2>
<p>Mahout provides an example for Naive Bayes classification:</p>
<ol>
<li><a href="twenty-newsgroups.html">Classify 20 Newsgroups</a></li>
</ol>
-<h3 id="references">References</h3>
+<h2 id="references">References</h2>
<p>[1]: Jason D. M. Rennie, Lawerence Shih, Jamie Teevan, David Karger (2003).
<a href="http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf">Tackling the
Poor Assumptions of Naive Bayes Text Classifiers</a>. Proceedings of the
Twentieth International Conference on Machine Learning (ICML-2003).</p>
</div>
</div>