spark-naive-bayes.html

buildbot Wed, 08 Jul 2015 12:34:56 -0700

Author: buildbot
Date: Wed Jul  8 19:34:09 2015
New Revision: 957555

Log:
Staging update by buildbot for mahout


Modified:
    websites/staging/mahout/trunk/content/   (props changed)
    
websites/staging/mahout/trunk/content/users/algorithms/spark-naive-bayes.html

Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Wed Jul  8 19:34:09 2015
@@ -1 +1 @@
-1688875
+1689946

Modified: 
websites/staging/mahout/trunk/content/users/algorithms/spark-naive-bayes.html
==============================================================================
--- 
websites/staging/mahout/trunk/content/users/algorithms/spark-naive-bayes.html 
(original)
+++ 
websites/staging/mahout/trunk/content/users/algorithms/spark-naive-bayes.html 
Wed Jul  8 19:34:09 2015
@@ -263,13 +263,24 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <h1 id="spark-naive-bayes">Spark Naive Bayes</h1>
-<h2 id="intro">Intro</h2>
+    <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="spark-naive-bayes">Spark Naive Bayes<a class="headerlink" 
href="#spark-naive-bayes" title="Permanent link">&para;</a></h1>
+<h2 id="intro">Intro<a class="headerlink" href="#intro" title="Permanent 
link">&para;</a></h2>
 <p>Mahout currently has two flavors of Naive Bayes.  The first is standard 
Multinomial Naive Bayes. The second is an implementation of Transformed 
Weight-normalized Complement Naive Bayes as introduced by Rennie et al. <a 
href="http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf";>[1]</a>. We 
refer to the former as Bayes and the latter as CBayes.</p>
 <p>Where Bayes has long been a standard in text classification, CBayes is an 
extension of Bayes that performs particularly well on datasets with skewed 
classes and has been shown to be competitive with algorithms of higher 
complexity such as Support Vector Machines. </p>
-<h2 id="implementations">Implementations</h2>
+<h2 id="implementations">Implementations<a class="headerlink" 
href="#implementations" title="Permanent link">&para;</a></h2>
 <p>The mahout <code>math-scala</code> library has an implemetation of both 
Bayes and CBayes which is further optimized in the <code>spark</code> module. 
Currently the Spark optimized version provides CLI drivers for training and 
testing. Mahout Spark-Naive-Bayes models can also be trained, tested and saved 
to the filesystem from the Mahout Spark Shell. </p>
-<h2 id="preprocessing-and-algorithm">Preprocessing and Algorithm</h2>
+<h2 id="preprocessing-and-algorithm">Preprocessing and Algorithm<a 
class="headerlink" href="#preprocessing-and-algorithm" title="Permanent 
link">&para;</a></h2>
 <p>As described in <a 
href="http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf";>[1]</a> Mahout 
Naive Bayes is broken down into the following steps (assignments are over all 
possible index values):  </p>
 <ul>
 <li>Let <code>\(\vec{d}=(\vec{d_1},...,\vec{d_n})\)</code> be a set of 
documents; <code>\(d_{ij}\)</code> is the count of word <code>\(i\)</code> in 
document <code>\(j\)</code>.</li>
@@ -299,7 +310,7 @@
 </li>
 </ul>
 <p>As we can see, the main difference between Bayes and CBayes is the weight 
calculation step.  Where Bayes weighs terms more heavily based on the 
likelihood that they belong to class <code>\(c\)</code>, CBayes seeks to 
maximize term weights on the likelihood that they do not belong to any other 
class.  </p>
-<h2 id="running-from-the-command-line">Running from the command line</h2>
+<h2 id="running-from-the-command-line">Running from the command line<a 
class="headerlink" href="#running-from-the-command-line" title="Permanent 
link">&para;</a></h2>
 <p>Mahout provides CLI drivers for all above steps.  Here we will give a 
simple overview of Mahout CLI commands used to preprocess the data, train the 
model and assign labels to the training set. An <a 
href="https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh";>example
 script</a> is given for the full process from data acquisition through 
classification of the classic <a 
href="https://mahout.apache.org/users/classification/twenty-newsgroups.html";>20 
Newsgroups corpus</a>.  </p>
 <ul>
 <li>
@@ -333,14 +344,13 @@ Classification and testing on a holdout
 <div class="codehilite"><pre>$ mahout spark-testnb 
   -i <span class="cp">${</span><span 
class="n">PATH_TO_TFIDF_TEST_VECTORS</span><span class="cp">}</span>
   -m <span class="cp">${</span><span class="n">PATH_TO_MODEL</span><span 
class="cp">}</span> 
-  -ow 
   -c
 </pre></div>
 
 
 </li>
 </ul>
-<h2 id="command-line-options">Command line options</h2>
+<h2 id="command-line-options">Command line options<a class="headerlink" 
href="#command-line-options" title="Permanent link">&para;</a></h2>
 <ul>
 <li><strong>Preprocessing:</strong> <em>note: still reliant on MapReduce 
seq2sparse</em> </li>
 </ul>
@@ -395,12 +405,12 @@ Classification and testing on a holdout
 
 </li>
 </ul>
-<h2 id="examples">Examples</h2>
+<h2 id="examples">Examples<a class="headerlink" href="#examples" 
title="Permanent link">&para;</a></h2>
 <ol>
 <li><a 
href="https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh";>20
 Newsgroups classification</a></li>
 <li><a 
href="https://github.com/apache/mahout/blob/master/examples/bin/spark-document-classifier.mscala";>Document
 classification with Naive Bayes in the Mahout shell</a></li>
 </ol>
-<h2 id="references">References</h2>
+<h2 id="references">References<a class="headerlink" href="#references" 
title="Permanent link">&para;</a></h2>
 <p>[1]: Jason D. M. Rennie, Lawerence Shih, Jamie Teevan, David Karger (2003). 
<a href="http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf";>Tackling the 
Poor Assumptions of Naive Bayes Text Classifiers</a>. Proceedings of the 
Twentieth International Conference on Machine Learning (ICML-2003).</p>
    </div>
   </div>

svn commit: r957555 - in /websites/staging/mahout/trunk/content: ./ users/algorithms/spark-naive-bayes.html

Reply via email to