classify-a-doc-from-the-shell.html

buildbot Wed, 22 Apr 2015 18:29:37 -0700

Author: buildbot
Date: Thu Apr 23 01:28:51 2015
New Revision: 948826

Log:
Staging update by buildbot for mahout


Modified:
    websites/staging/mahout/trunk/content/   (props changed)
    
websites/staging/mahout/trunk/content/users/environment/classify-a-doc-from-the-shell.html

Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Thu Apr 23 01:28:51 2015
@@ -1 +1 @@
-1675527
+1675528

Modified: 
websites/staging/mahout/trunk/content/users/environment/classify-a-doc-from-the-shell.html
==============================================================================
--- 
websites/staging/mahout/trunk/content/users/environment/classify-a-doc-from-the-shell.html
 (original)
+++ 
websites/staging/mahout/trunk/content/users/environment/classify-a-doc-from-the-shell.html
 Thu Apr 23 01:28:51 2015
@@ -262,9 +262,11 @@
   <div id="content-wrap" class="clearfix">
    <div id="main">
     <h1 id="classifying-a-document-with-the-mahout-shell">Classifying a 
Document with the Mahout Shell</h1>
-<p>This tutorial assumes that you have Spark configured for the 
<code>spark-shell</code> See <a 
href="http://mahout.apache.org/users/sparkbindings/play-with-shell.html";>Playing
 with Mahout's Shell</a>.  As well we assume that Mahout is running in cluster 
mode (i.e. with the <code>MAHOUT_LOCAL</code> environment variable unset) so 
that the output is put into HDFS.</p>
-<h2 id="downloading-and-vectorizing-the-wikipedia-dataset">Downloading and 
Vectorizing the wikipedia dataset</h2>
-<p><em>As of Mahout v0.10.0, we are still reliant on the MapReduce versions of 
<code>mahout seqwiki</code> and <code>mahout seq2sparse</code> to extract and 
vectorize our text.  A</em> <a 
href="https://issues.apache.org/jira/browse/MAHOUT-1663";><em>Spark 
implemenation of seq2sparse</em></a> <em>is in the works for Mahout v0.11.</em> 
However, to download the wikipedia dataset, extract the bodies of the 
documentation, label each document and vectorize the text into TF-IDF vectors, 
we can sipmly run the <a 
href="https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh";>wikipedia-classifier.sh</a>
 example.  </p>
+<p>This tutorial will take you through the steps used to train and create a 
Multinomial Naive Bayes text classifier using the <code>mahout 
spark-shell</code>. </p>
+<h2 id="prerequisites">Prerequisites</h2>
+<p>This tutorial assumes that you have your Spark environment variables set 
for the <code>mahout spark-shell</code> see: <a 
href="http://mahout.apache.org/users/sparkbindings/play-with-shell.html";>Playing
 with Mahout's Shell</a>.  As well we assume that Mahout is running in cluster 
mode (i.e. with the <code>MAHOUT_LOCAL</code> environment variable 
<strong>unset</strong>) as we'll be reading and writing to HDFS.</p>
+<h2 id="downloading-and-vectorizing-the-wikipedia-dataset">Downloading and 
Vectorizing the Wikipedia dataset</h2>
+<p><em>As of Mahout v. 0.10.0, we are still reliant on the MapReduce versions 
of <code>mahout seqwiki</code> and <code>mahout seq2sparse</code> to extract 
and vectorize our text.  A</em> <a 
href="https://issues.apache.org/jira/browse/MAHOUT-1663";><em>Spark 
implementation of seq2sparse</em></a> <em>is in the works for Mahout v. 
0.11.</em> However, to download the Wikipedia dataset, extract the bodies of 
the documentation, label each document and vectorize the text into TF-IDF 
vectors, we can simpmly run the <a 
href="https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh";>wikipedia-classifier.sh</a>
 example.  </p>
 <div class="codehilite"><pre><span class="n">Please</span> <span 
class="n">select</span> <span class="n">a</span> <span class="n">number</span> 
<span class="n">to</span> <span class="n">choose</span> <span 
class="n">the</span> <span class="n">corresponding</span> <span 
class="n">task</span> <span class="n">to</span> <span class="n">run</span>
 1<span class="p">.</span> <span class="n">CBayes</span> <span 
class="p">(</span><span class="n">may</span> <span class="n">require</span> 
<span class="n">increased</span> <span class="n">heap</span> <span 
class="n">space</span> <span class="n">on</span> <span 
class="n">yarn</span><span class="p">)</span>
 2<span class="p">.</span> <span class="n">BinaryCBayes</span>
@@ -273,16 +275,16 @@
 </pre></div>
 
 
-<p>Enter (2). This will download a large recent XML dump of the wikipedia 
database, into a <code>/tmp/mahout-work-wiki</code> directory, unzip it and  
place it into HDFS.  It will run a <a 
href="http://mahout.apache.org/users/classification/wikipedia-classifier-example.html";>MapReduce
 job to parse the wikipedia set</a>, extracting and labeling only pages with 
category tags for [United States] and [United Kingdom]. It will then run 
<code>mahout seq2sparse</code> to convert the documents into TF-IDF vectors.  
The script will also a build and test a <a 
href="http://mahout.apache.org/users/classification/bayesian.html";>Naive Bayes 
model using MapReduce</a>.  When it is completed, you should see a confusion 
matrix on your screen.  For this tutorial, we will ignore the MapReduce model, 
and build a new model using Spark based on the vectorization data created by 
<code>seq2sparse</code>.</p>
+<p>Enter (2). This will download a large recent XML dump of the Wikipedia 
database, into a <code>/tmp/mahout-work-wiki</code> directory, unzip it and  
place it into HDFS.  It will run a <a 
href="http://mahout.apache.org/users/classification/wikipedia-classifier-example.html";>MapReduce
 job to parse the wikipedia set</a>, extracting and labeling only pages with 
category tags for [United States] and [United Kingdom] (~11600 documents). It 
will then run <code>mahout seq2sparse</code> to convert the documents into 
TF-IDF vectors.  The script will also a build and test a <a 
href="http://mahout.apache.org/users/classification/bayesian.html";>Naive Bayes 
model using MapReduce</a>.  When it is completed, you should see a confusion 
matrix on your screen.  For this tutorial, we will ignore the MapReduce model, 
and build a new model using Spark based on the vectorized text output by 
<code>seq2sparse</code>.</p>
 <h2 id="getting-started">Getting Started</h2>
-<p>Launch the <code>mahout-shell</code>.  There is an example script: 
<code>spark-document-classifier.mscala</code> (<code>.mscala</code> denotes a 
Mahout-Scala script which can be run similarly to an R-script).   We will be 
walking through this script for this tutorial but if you wanted to simply run 
the script, you could just issue the command: </p>
+<p>Launch the <code>mahout spark-shell</code>.  There is an example script: 
<code>spark-document-classifier.mscala</code> (.mscala denotes a Mahout-Scala 
script which can be run similarly to an R script).   We will be walking through 
this script for this tutorial but if you wanted to simply run the script, you 
could just issue the command: </p>
 <div class="codehilite"><pre><span class="n">mahout</span><span 
class="o">&gt;</span> <span class="p">:</span><span class="n">load</span> <span 
class="o">/</span><span class="n">path</span><span class="o">/</span><span 
class="n">to</span><span class="o">/</span><span class="n">mahout</span><span 
class="o">/</span><span class="n">examples</span><span class="o">/</span><span 
class="n">bin</span><span class="o">/</span><span class="n">spark</span><span 
class="o">-</span><span class="n">document</span><span class="o">-</span><span 
class="n">classifier</span><span class="p">.</span><span class="n">mscala</span>
 </pre></div>
 
 
-<p>For now, lets take the script apart piece by piece.</p>
+<p>For now, lets take the script apart piece by piece.  You can cut and paste 
the following code blocks into the <code>mahout spark-shell</code>.</p>
 <h2 id="imports">Imports</h2>
-<p>Our mahout Naive Bayes Imports:</p>
+<p>Our Mahout Naive Bayes Imports:</p>
 <div class="codehilite"><pre><span class="n">import</span> <span 
class="n">org</span><span class="p">.</span><span class="n">apache</span><span 
class="p">.</span><span class="n">mahout</span><span class="p">.</span><span 
class="n">classifier</span><span class="p">.</span><span 
class="n">naivebayes</span><span class="p">.</span><span class="n">_</span>
 <span class="n">import</span> <span class="n">org</span><span 
class="p">.</span><span class="n">apache</span><span class="p">.</span><span 
class="n">mahout</span><span class="p">.</span><span 
class="n">classifier</span><span class="p">.</span><span 
class="n">stats</span><span class="p">.</span><span class="n">_</span>
 <span class="n">import</span> <span class="n">org</span><span 
class="p">.</span><span class="n">apache</span><span class="p">.</span><span 
class="n">mahout</span><span class="p">.</span><span class="n">nlp</span><span 
class="p">.</span><span class="n">tfidf</span><span class="p">.</span><span 
class="n">_</span>
@@ -296,26 +298,27 @@
 </pre></div>
 
 
-<h2 
id="read-in-our-full-set-from-hdfs-as-vectorized-by-seq2sparse-in-classify-wikipediash">read
 in our full set from HDFS as vectorized by seq2sparse in 
classify-wikipedia.sh</h2>
+<h2 
id="read-in-our-full-set-from-hdfs-as-vectorized-by-seq2sparse-in-classify-wikipediash">Read
 in our full set from HDFS as vectorized by seq2sparse in 
classify-wikipedia.sh</h2>
 <div class="codehilite"><pre><span class="n">val</span> <span 
class="n">pathToData</span> <span class="p">=</span> &quot;<span 
class="o">/</span><span class="n">tmp</span><span class="o">/</span><span 
class="n">mahout</span><span class="o">-</span><span class="n">work</span><span 
class="o">-</span><span class="n">wiki</span><span class="o">/</span>&quot;
 <span class="n">val</span> <span class="n">fullData</span> <span 
class="p">=</span> <span class="n">drmDfsRead</span><span 
class="p">(</span><span class="n">pathToData</span> <span class="o">+</span> 
&quot;<span class="n">wikipediaVecs</span><span class="o">/</span><span 
class="n">tfidf</span><span class="o">-</span><span 
class="n">vectors</span>&quot;<span class="p">)</span>
 </pre></div>
 
 
-<h2 
id="extract-the-category-of-each-observation-and-aggregate-those-observation-by-category">extract
 the category of each observation and aggregate those observation by 
category</h2>
-<div class="codehilite"><pre><span class="n">val</span> <span 
class="p">(</span><span class="n">labelIndex</span><span class="p">,</span> 
<span class="n">aggregatedObservations</span><span class="p">)</span> <span 
class="p">=</span> <span class="n">SparkNaiveBayes</span><span 
class="p">.</span><span 
class="n">extractLabelsAndAggregateObservations</span><span 
class="p">(</span><span class="n">fullData</span><span class="p">)</span>
+<h2 
id="extract-the-category-of-each-observation-and-aggregate-those-observation-by-category">Extract
 the category of each observation and aggregate those observation by 
category</h2>
+<div class="codehilite"><pre><span class="n">val</span> <span 
class="p">(</span><span class="n">labelIndex</span><span class="p">,</span> 
<span class="n">aggregatedObservations</span><span class="p">)</span> <span 
class="p">=</span> <span class="n">SparkNaiveBayes</span><span 
class="p">.</span><span 
class="n">extractLabelsAndAggregateObservations</span><span class="p">(</span>
+                                                             <span 
class="n">fullData</span><span class="p">)</span>
 </pre></div>
 
 
-<h2 
id="build-a-muitinomial-naive-bayes-model-and-self-test-on-the-training-set">build
 a Muitinomial Naive Bayes model and self test on the training set</h2>
+<h2 
id="build-a-muitinomial-naive-bayes-model-and-self-test-on-the-training-set">Build
 a Muitinomial Naive Bayes model and self test on the training set</h2>
 <div class="codehilite"><pre><span class="n">val</span> <span 
class="n">model</span> <span class="p">=</span> <span 
class="n">SparkNaiveBayes</span><span class="p">.</span><span 
class="n">train</span><span class="p">(</span><span 
class="n">aggregatedObservations</span><span class="p">,</span> <span 
class="n">labelIndex</span><span class="p">,</span> <span 
class="n">false</span><span class="p">)</span>
 <span class="n">val</span> <span class="n">resAnalyzer</span> <span 
class="p">=</span> <span class="n">SparkNaiveBayes</span><span 
class="p">.</span><span class="n">test</span><span class="p">(</span><span 
class="n">model</span><span class="p">,</span> <span 
class="n">fullData</span><span class="p">,</span> <span 
class="n">false</span><span class="p">)</span>
 <span class="n">println</span><span class="p">(</span><span 
class="n">resAnalyzer</span><span class="p">)</span>
 </pre></div>
 
 
-<p>printing the result analyzer will display the confusion matrix</p>
-<h2 id="read-in-the-dictionary-and-document-frequency-count-from-hdfs">read in 
the dictionary and document frequency count from HDFS</h2>
+<p>printing the <code>ResultAnalyzer</code> will display the confusion 
matrix.</p>
+<h2 id="read-in-the-dictionary-and-document-frequency-count-from-hdfs">Read in 
the dictionary and document frequency count from HDFS</h2>
 <div class="codehilite"><pre><span class="n">val</span> <span 
class="n">dictionary</span> <span class="p">=</span> <span 
class="n">sdc</span><span class="p">.</span><span 
class="n">sequenceFile</span><span class="p">(</span><span 
class="n">pathToData</span> <span class="o">+</span> &quot;<span 
class="n">wikipediaVecs</span><span class="o">/</span><span 
class="n">dictionary</span><span class="p">.</span><span 
class="n">file</span><span class="o">-</span>0&quot;<span class="p">,</span>
                                   <span class="n">classOf</span><span 
class="p">[</span><span class="n">Text</span><span class="p">],</span>
                                   <span class="n">classOf</span><span 
class="p">[</span><span class="n">IntWritable</span><span class="p">])</span>
@@ -339,8 +342,8 @@
 </pre></div>
 
 
-<h2 
id="define-a-function-to-tokeinze-and-vectorize-new-text-using-our-current-dictionary">define
 a function to tokeinze and vectorize new text using our current dictionary</h2>
-<p>For this simple example, our function ```vectorizeDocument(...) will 
tokenize a new document into unigrams using native Java String methods and 
vectorize usingour dictionary and document frequencies. You could also use a <a 
href="https://lucene.apache.org/core/";>Lucene</a> analyzer for bigrams, 
trigrams, etc., and integrate Apache <a 
href="https://tika.apache.org/";>Tika</a> to extract text from different 
document types (PDF, PPT, XLS, etc.).  Here, however we will kwwp ot simple and 
split ouor text using regexs and native String methods.</p>
+<h2 
id="define-a-function-to-tokenize-and-vectorize-new-text-using-our-current-dictionary">Define
 a function to tokenize and vectorize new text using our current dictionary</h2>
+<p>For this simple example, our function ```vectorizeDocument(...) will 
tokenize a new document into unigrams using native Java String methods and 
vectorize using our dictionary and document frequencies. You could also use a 
<a href="https://lucene.apache.org/core/";>Lucene</a> analyzer for bigrams, 
trigrams, etc., and integrate Apache <a 
href="https://tika.apache.org/";>Tika</a> to extract text from different 
document types (PDF, PPT, XLS, etc.).  Here, however we will keep it simple and 
split our text using regexs and native String methods.</p>
 <div class="codehilite"><pre>def vectorizeDocument<span 
class="p">(</span>document: String<span class="p">,</span>
                         dictionaryMap: Map<span class="p">[</span>String<span 
class="p">,</span>Int<span class="p">],</span>
                         dfMap: Map<span class="p">[</span>Int<span 
class="p">,</span>Long<span class="p">])</span>: Vector <span 
class="o">=</span> <span class="p">{</span>
@@ -371,7 +374,7 @@
 </pre></div>
 
 
-<h2 id="setup-our-classifier">setup our classifier</h2>
+<h2 id="setup-our-classifier">Setup our classifier</h2>
 <div class="codehilite"><pre><span class="n">val</span> <span 
class="n">labelMap</span> <span class="p">=</span> <span 
class="n">model</span><span class="p">.</span><span class="n">labelIndex</span>
 <span class="n">val</span> <span class="n">numLabels</span> <span 
class="p">=</span> <span class="n">model</span><span class="p">.</span><span 
class="n">numLabels</span>
 <span class="n">val</span> <span class="n">reverseLabelMap</span> <span 
class="p">=</span> <span class="n">labelMap</span><span class="p">.</span><span 
class="n">map</span><span class="p">(</span><span class="n">x</span> <span 
class="p">=</span><span class="o">&gt;</span> <span class="n">x</span><span 
class="p">.</span><span class="n">_2</span> <span class="o">-&gt;</span> <span 
class="n">x</span><span class="p">.</span><span class="n">_1</span><span 
class="p">)</span>
@@ -384,8 +387,8 @@
 </pre></div>
 
 
-<h2 id="define-an-argmax-function">define an argmax function</h2>
-<p>The label with the higest score wins the classification for a given 
document</p>
+<h2 id="define-an-argmax-function">Define an argmax function</h2>
+<p>The label with the highest score wins the classification for a given 
document.</p>
 <div class="codehilite"><pre>def argmax<span class="p">(</span>v: Vector<span 
class="p">)</span>: <span class="p">(</span>Int<span class="p">,</span> 
Double<span class="p">)</span> <span class="o">=</span> <span class="p">{</span>
     var bestIdx: Int <span class="o">=</span> Integer.MIN_VALUE
     var bestScore: Double <span class="o">=</span> 
Integer.MIN_VALUE.asInstanceOf<span class="p">[</span>Int<span 
class="p">]</span><span class="m">.</span>toDouble
@@ -400,7 +403,7 @@
 </pre></div>
 
 
-<h2 id="define-our-final-tf-idf-vector-classifier">define our final TF(-IDF) 
vector classifier</h2>
+<h2 id="define-our-final-tf-idf-vector-classifier">Define our final TF(-IDF) 
vector classifier</h2>
 <div class="codehilite"><pre><span class="n">def</span> <span 
class="n">classifyDocument</span><span class="p">(</span><span 
class="n">clvec</span><span class="p">:</span> <span 
class="n">Vector</span><span class="p">)</span> <span class="p">:</span> <span 
class="n">String</span> <span class="p">=</span> <span class="p">{</span>
     <span class="n">val</span> <span class="n">cvec</span> <span 
class="p">=</span> <span class="n">classifier</span><span 
class="p">.</span><span class="n">classifyFull</span><span 
class="p">(</span><span class="n">clvec</span><span class="p">)</span>
     <span class="n">val</span> <span class="p">(</span><span 
class="n">bestIdx</span><span class="p">,</span> <span 
class="n">bestScore</span><span class="p">)</span> <span class="p">=</span> 
<span class="n">argmax</span><span class="p">(</span><span 
class="n">cvec</span><span class="p">)</span>
@@ -478,7 +481,7 @@
 </pre></div>
 
 
-<h2 id="vectorize-and-classify-our-documents">vectorize and classify our 
documents</h2>
+<h2 id="vectorize-and-classify-our-documents">Vectorize and classify our 
documents</h2>
 <div class="codehilite"><pre><span class="n">val</span> <span 
class="n">usVec</span> <span class="p">=</span> <span 
class="n">vectorizeDocument</span><span class="p">(</span><span 
class="n">UStextToClassify</span><span class="p">,</span> <span 
class="n">dictionaryMap</span><span class="p">,</span> <span 
class="n">dfCountMap</span><span class="p">)</span>
 <span class="n">val</span> <span class="n">ukVec</span> <span 
class="p">=</span> <span class="n">vectorizeDocument</span><span 
class="p">(</span><span class="n">UKtextToClassify</span><span 
class="p">,</span> <span class="n">dictionaryMap</span><span class="p">,</span> 
<span class="n">dfCountMap</span><span class="p">)</span>
 
@@ -490,7 +493,7 @@
 </pre></div>
 
 
-<h2 id="tie-everything-together-in-a-new-method-to-classify-new-text">tie 
everything together in a new method to classify new text</h2>
+<h2 id="tie-everything-together-in-a-new-method-to-classify-new-text">Tie 
everything together in a new method to classify new text</h2>
 <div class="codehilite"><pre><span class="n">def</span> <span 
class="n">classifyText</span><span class="p">(</span><span 
class="n">txt</span><span class="p">:</span> <span class="n">String</span><span 
class="p">):</span> <span class="n">String</span> <span class="p">=</span> 
<span class="p">{</span>
     <span class="n">val</span> <span class="n">v</span> <span 
class="p">=</span> <span class="n">vectorizeDocument</span><span 
class="p">(</span><span class="n">txt</span><span class="p">,</span> <span 
class="n">dictionaryMap</span><span class="p">,</span> <span 
class="n">dfCountMap</span><span class="p">)</span>
     <span class="n">classifyDocument</span><span class="p">(</span><span 
class="n">v</span><span class="p">)</span>
@@ -499,7 +502,7 @@
 </pre></div>
 
 
-<h2 id="now-we-can-simply-call-our-classifytext-method-on-any-string">now we 
can simply call our classifyText method on any string</h2>
+<h2 id="now-we-can-simply-call-our-classifytext-method-on-any-string">Now we 
can simply call our classifyText(...) method on any string</h2>
 <div class="codehilite"><pre><span class="n">classifyText</span><span 
class="p">(</span>&quot;<span class="n">Hello</span> <span 
class="n">world</span> <span class="n">from</span> <span 
class="n">Queens</span>&quot;<span class="p">)</span>
 <span class="n">classifyText</span><span class="p">(</span>&quot;<span 
class="n">Hello</span> <span class="n">world</span> <span class="n">from</span> 
<span class="n">London</span>&quot;<span class="p">)</span>
 </pre></div>

svn commit: r948826 - in /websites/staging/mahout/trunk/content: ./ users/environment/classify-a-doc-from-the-shell.html

Reply via email to