Author: buildbot
Date: Wed Jul 8 19:34:09 2015
New Revision: 957555
Log:
Staging update by buildbot for mahout
Modified:
websites/staging/mahout/trunk/content/ (props changed)
websites/staging/mahout/trunk/content/users/algorithms/spark-naive-bayes.html
Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Wed Jul 8 19:34:09 2015
@@ -1 +1 @@
-1688875
+1689946
Modified:
websites/staging/mahout/trunk/content/users/algorithms/spark-naive-bayes.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/algorithms/spark-naive-bayes.html
(original)
+++
websites/staging/mahout/trunk/content/users/algorithms/spark-naive-bayes.html
Wed Jul 8 19:34:09 2015
@@ -263,13 +263,24 @@
<div id="content-wrap" class="clearfix">
<div id="main">
- <h1 id="spark-naive-bayes">Spark Naive Bayes</h1>
-<h2 id="intro">Intro</h2>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="spark-naive-bayes">Spark Naive Bayes<a class="headerlink"
href="#spark-naive-bayes" title="Permanent link">¶</a></h1>
+<h2 id="intro">Intro<a class="headerlink" href="#intro" title="Permanent
link">¶</a></h2>
<p>Mahout currently has two flavors of Naive Bayes. The first is standard
Multinomial Naive Bayes. The second is an implementation of Transformed
Weight-normalized Complement Naive Bayes as introduced by Rennie et al. <a
href="http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf">[1]</a>. We
refer to the former as Bayes and the latter as CBayes.</p>
<p>Where Bayes has long been a standard in text classification, CBayes is an
extension of Bayes that performs particularly well on datasets with skewed
classes and has been shown to be competitive with algorithms of higher
complexity such as Support Vector Machines. </p>
-<h2 id="implementations">Implementations</h2>
+<h2 id="implementations">Implementations<a class="headerlink"
href="#implementations" title="Permanent link">¶</a></h2>
<p>The mahout <code>math-scala</code> library has an implemetation of both
Bayes and CBayes which is further optimized in the <code>spark</code> module.
Currently the Spark optimized version provides CLI drivers for training and
testing. Mahout Spark-Naive-Bayes models can also be trained, tested and saved
to the filesystem from the Mahout Spark Shell. </p>
-<h2 id="preprocessing-and-algorithm">Preprocessing and Algorithm</h2>
+<h2 id="preprocessing-and-algorithm">Preprocessing and Algorithm<a
class="headerlink" href="#preprocessing-and-algorithm" title="Permanent
link">¶</a></h2>
<p>As described in <a
href="http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf">[1]</a> Mahout
Naive Bayes is broken down into the following steps (assignments are over all
possible index values): </p>
<ul>
<li>Let <code>\(\vec{d}=(\vec{d_1},...,\vec{d_n})\)</code> be a set of
documents; <code>\(d_{ij}\)</code> is the count of word <code>\(i\)</code> in
document <code>\(j\)</code>.</li>
@@ -299,7 +310,7 @@
</li>
</ul>
<p>As we can see, the main difference between Bayes and CBayes is the weight
calculation step. Where Bayes weighs terms more heavily based on the
likelihood that they belong to class <code>\(c\)</code>, CBayes seeks to
maximize term weights on the likelihood that they do not belong to any other
class. </p>
-<h2 id="running-from-the-command-line">Running from the command line</h2>
+<h2 id="running-from-the-command-line">Running from the command line<a
class="headerlink" href="#running-from-the-command-line" title="Permanent
link">¶</a></h2>
<p>Mahout provides CLI drivers for all above steps. Here we will give a
simple overview of Mahout CLI commands used to preprocess the data, train the
model and assign labels to the training set. An <a
href="https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh">example
script</a> is given for the full process from data acquisition through
classification of the classic <a
href="https://mahout.apache.org/users/classification/twenty-newsgroups.html">20
Newsgroups corpus</a>. </p>
<ul>
<li>
@@ -333,14 +344,13 @@ Classification and testing on a holdout
<div class="codehilite"><pre>$ mahout spark-testnb
-i <span class="cp">${</span><span
class="n">PATH_TO_TFIDF_TEST_VECTORS</span><span class="cp">}</span>
-m <span class="cp">${</span><span class="n">PATH_TO_MODEL</span><span
class="cp">}</span>
- -ow
-c
</pre></div>
</li>
</ul>
-<h2 id="command-line-options">Command line options</h2>
+<h2 id="command-line-options">Command line options<a class="headerlink"
href="#command-line-options" title="Permanent link">¶</a></h2>
<ul>
<li><strong>Preprocessing:</strong> <em>note: still reliant on MapReduce
seq2sparse</em> </li>
</ul>
@@ -395,12 +405,12 @@ Classification and testing on a holdout
</li>
</ul>
-<h2 id="examples">Examples</h2>
+<h2 id="examples">Examples<a class="headerlink" href="#examples"
title="Permanent link">¶</a></h2>
<ol>
<li><a
href="https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh">20
Newsgroups classification</a></li>
<li><a
href="https://github.com/apache/mahout/blob/master/examples/bin/spark-document-classifier.mscala">Document
classification with Naive Bayes in the Mahout shell</a></li>
</ol>
-<h2 id="references">References</h2>
+<h2 id="references">References<a class="headerlink" href="#references"
title="Permanent link">¶</a></h2>
<p>[1]: Jason D. M. Rennie, Lawerence Shih, Jamie Teevan, David Karger (2003).
<a href="http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf">Tackling the
Poor Assumptions of Naive Bayes Text Classifiers</a>. Proceedings of the
Twentieth International Conference on Machine Learning (ICML-2003).</p>
</div>
</div>