Author: buildbot
Date: Thu Apr 23 01:28:51 2015
New Revision: 948826
Log:
Staging update by buildbot for mahout
Modified:
websites/staging/mahout/trunk/content/ (props changed)
websites/staging/mahout/trunk/content/users/environment/classify-a-doc-from-the-shell.html
Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Thu Apr 23 01:28:51 2015
@@ -1 +1 @@
-1675527
+1675528
Modified:
websites/staging/mahout/trunk/content/users/environment/classify-a-doc-from-the-shell.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/environment/classify-a-doc-from-the-shell.html
(original)
+++
websites/staging/mahout/trunk/content/users/environment/classify-a-doc-from-the-shell.html
Thu Apr 23 01:28:51 2015
@@ -262,9 +262,11 @@
<div id="content-wrap" class="clearfix">
<div id="main">
<h1 id="classifying-a-document-with-the-mahout-shell">Classifying a
Document with the Mahout Shell</h1>
-<p>This tutorial assumes that you have Spark configured for the
<code>spark-shell</code> See <a
href="http://mahout.apache.org/users/sparkbindings/play-with-shell.html">Playing
with Mahout's Shell</a>. As well we assume that Mahout is running in cluster
mode (i.e. with the <code>MAHOUT_LOCAL</code> environment variable unset) so
that the output is put into HDFS.</p>
-<h2 id="downloading-and-vectorizing-the-wikipedia-dataset">Downloading and
Vectorizing the wikipedia dataset</h2>
-<p><em>As of Mahout v0.10.0, we are still reliant on the MapReduce versions of
<code>mahout seqwiki</code> and <code>mahout seq2sparse</code> to extract and
vectorize our text. A</em> <a
href="https://issues.apache.org/jira/browse/MAHOUT-1663"><em>Spark
implemenation of seq2sparse</em></a> <em>is in the works for Mahout v0.11.</em>
However, to download the wikipedia dataset, extract the bodies of the
documentation, label each document and vectorize the text into TF-IDF vectors,
we can sipmly run the <a
href="https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh">wikipedia-classifier.sh</a>
example. </p>
+<p>This tutorial will take you through the steps used to train and create a
Multinomial Naive Bayes text classifier using the <code>mahout
spark-shell</code>. </p>
+<h2 id="prerequisites">Prerequisites</h2>
+<p>This tutorial assumes that you have your Spark environment variables set
for the <code>mahout spark-shell</code> see: <a
href="http://mahout.apache.org/users/sparkbindings/play-with-shell.html">Playing
with Mahout's Shell</a>. As well we assume that Mahout is running in cluster
mode (i.e. with the <code>MAHOUT_LOCAL</code> environment variable
<strong>unset</strong>) as we'll be reading and writing to HDFS.</p>
+<h2 id="downloading-and-vectorizing-the-wikipedia-dataset">Downloading and
Vectorizing the Wikipedia dataset</h2>
+<p><em>As of Mahout v. 0.10.0, we are still reliant on the MapReduce versions
of <code>mahout seqwiki</code> and <code>mahout seq2sparse</code> to extract
and vectorize our text. A</em> <a
href="https://issues.apache.org/jira/browse/MAHOUT-1663"><em>Spark
implementation of seq2sparse</em></a> <em>is in the works for Mahout v.
0.11.</em> However, to download the Wikipedia dataset, extract the bodies of
the documentation, label each document and vectorize the text into TF-IDF
vectors, we can simpmly run the <a
href="https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh">wikipedia-classifier.sh</a>
example. </p>
<div class="codehilite"><pre><span class="n">Please</span> <span
class="n">select</span> <span class="n">a</span> <span class="n">number</span>
<span class="n">to</span> <span class="n">choose</span> <span
class="n">the</span> <span class="n">corresponding</span> <span
class="n">task</span> <span class="n">to</span> <span class="n">run</span>
1<span class="p">.</span> <span class="n">CBayes</span> <span
class="p">(</span><span class="n">may</span> <span class="n">require</span>
<span class="n">increased</span> <span class="n">heap</span> <span
class="n">space</span> <span class="n">on</span> <span
class="n">yarn</span><span class="p">)</span>
2<span class="p">.</span> <span class="n">BinaryCBayes</span>
@@ -273,16 +275,16 @@
</pre></div>
-<p>Enter (2). This will download a large recent XML dump of the wikipedia
database, into a <code>/tmp/mahout-work-wiki</code> directory, unzip it and
place it into HDFS. It will run a <a
href="http://mahout.apache.org/users/classification/wikipedia-classifier-example.html">MapReduce
job to parse the wikipedia set</a>, extracting and labeling only pages with
category tags for [United States] and [United Kingdom]. It will then run
<code>mahout seq2sparse</code> to convert the documents into TF-IDF vectors.
The script will also a build and test a <a
href="http://mahout.apache.org/users/classification/bayesian.html">Naive Bayes
model using MapReduce</a>. When it is completed, you should see a confusion
matrix on your screen. For this tutorial, we will ignore the MapReduce model,
and build a new model using Spark based on the vectorization data created by
<code>seq2sparse</code>.</p>
+<p>Enter (2). This will download a large recent XML dump of the Wikipedia
database, into a <code>/tmp/mahout-work-wiki</code> directory, unzip it and
place it into HDFS. It will run a <a
href="http://mahout.apache.org/users/classification/wikipedia-classifier-example.html">MapReduce
job to parse the wikipedia set</a>, extracting and labeling only pages with
category tags for [United States] and [United Kingdom] (~11600 documents). It
will then run <code>mahout seq2sparse</code> to convert the documents into
TF-IDF vectors. The script will also a build and test a <a
href="http://mahout.apache.org/users/classification/bayesian.html">Naive Bayes
model using MapReduce</a>. When it is completed, you should see a confusion
matrix on your screen. For this tutorial, we will ignore the MapReduce model,
and build a new model using Spark based on the vectorized text output by
<code>seq2sparse</code>.</p>
<h2 id="getting-started">Getting Started</h2>
-<p>Launch the <code>mahout-shell</code>. There is an example script:
<code>spark-document-classifier.mscala</code> (<code>.mscala</code> denotes a
Mahout-Scala script which can be run similarly to an R-script). We will be
walking through this script for this tutorial but if you wanted to simply run
the script, you could just issue the command: </p>
+<p>Launch the <code>mahout spark-shell</code>. There is an example script:
<code>spark-document-classifier.mscala</code> (.mscala denotes a Mahout-Scala
script which can be run similarly to an R script). We will be walking through
this script for this tutorial but if you wanted to simply run the script, you
could just issue the command: </p>
<div class="codehilite"><pre><span class="n">mahout</span><span
class="o">></span> <span class="p">:</span><span class="n">load</span> <span
class="o">/</span><span class="n">path</span><span class="o">/</span><span
class="n">to</span><span class="o">/</span><span class="n">mahout</span><span
class="o">/</span><span class="n">examples</span><span class="o">/</span><span
class="n">bin</span><span class="o">/</span><span class="n">spark</span><span
class="o">-</span><span class="n">document</span><span class="o">-</span><span
class="n">classifier</span><span class="p">.</span><span class="n">mscala</span>
</pre></div>
-<p>For now, lets take the script apart piece by piece.</p>
+<p>For now, lets take the script apart piece by piece. You can cut and paste
the following code blocks into the <code>mahout spark-shell</code>.</p>
<h2 id="imports">Imports</h2>
-<p>Our mahout Naive Bayes Imports:</p>
+<p>Our Mahout Naive Bayes Imports:</p>
<div class="codehilite"><pre><span class="n">import</span> <span
class="n">org</span><span class="p">.</span><span class="n">apache</span><span
class="p">.</span><span class="n">mahout</span><span class="p">.</span><span
class="n">classifier</span><span class="p">.</span><span
class="n">naivebayes</span><span class="p">.</span><span class="n">_</span>
<span class="n">import</span> <span class="n">org</span><span
class="p">.</span><span class="n">apache</span><span class="p">.</span><span
class="n">mahout</span><span class="p">.</span><span
class="n">classifier</span><span class="p">.</span><span
class="n">stats</span><span class="p">.</span><span class="n">_</span>
<span class="n">import</span> <span class="n">org</span><span
class="p">.</span><span class="n">apache</span><span class="p">.</span><span
class="n">mahout</span><span class="p">.</span><span class="n">nlp</span><span
class="p">.</span><span class="n">tfidf</span><span class="p">.</span><span
class="n">_</span>
@@ -296,26 +298,27 @@
</pre></div>
-<h2
id="read-in-our-full-set-from-hdfs-as-vectorized-by-seq2sparse-in-classify-wikipediash">read
in our full set from HDFS as vectorized by seq2sparse in
classify-wikipedia.sh</h2>
+<h2
id="read-in-our-full-set-from-hdfs-as-vectorized-by-seq2sparse-in-classify-wikipediash">Read
in our full set from HDFS as vectorized by seq2sparse in
classify-wikipedia.sh</h2>
<div class="codehilite"><pre><span class="n">val</span> <span
class="n">pathToData</span> <span class="p">=</span> "<span
class="o">/</span><span class="n">tmp</span><span class="o">/</span><span
class="n">mahout</span><span class="o">-</span><span class="n">work</span><span
class="o">-</span><span class="n">wiki</span><span class="o">/</span>"
<span class="n">val</span> <span class="n">fullData</span> <span
class="p">=</span> <span class="n">drmDfsRead</span><span
class="p">(</span><span class="n">pathToData</span> <span class="o">+</span>
"<span class="n">wikipediaVecs</span><span class="o">/</span><span
class="n">tfidf</span><span class="o">-</span><span
class="n">vectors</span>"<span class="p">)</span>
</pre></div>
-<h2
id="extract-the-category-of-each-observation-and-aggregate-those-observation-by-category">extract
the category of each observation and aggregate those observation by
category</h2>
-<div class="codehilite"><pre><span class="n">val</span> <span
class="p">(</span><span class="n">labelIndex</span><span class="p">,</span>
<span class="n">aggregatedObservations</span><span class="p">)</span> <span
class="p">=</span> <span class="n">SparkNaiveBayes</span><span
class="p">.</span><span
class="n">extractLabelsAndAggregateObservations</span><span
class="p">(</span><span class="n">fullData</span><span class="p">)</span>
+<h2
id="extract-the-category-of-each-observation-and-aggregate-those-observation-by-category">Extract
the category of each observation and aggregate those observation by
category</h2>
+<div class="codehilite"><pre><span class="n">val</span> <span
class="p">(</span><span class="n">labelIndex</span><span class="p">,</span>
<span class="n">aggregatedObservations</span><span class="p">)</span> <span
class="p">=</span> <span class="n">SparkNaiveBayes</span><span
class="p">.</span><span
class="n">extractLabelsAndAggregateObservations</span><span class="p">(</span>
+ <span
class="n">fullData</span><span class="p">)</span>
</pre></div>
-<h2
id="build-a-muitinomial-naive-bayes-model-and-self-test-on-the-training-set">build
a Muitinomial Naive Bayes model and self test on the training set</h2>
+<h2
id="build-a-muitinomial-naive-bayes-model-and-self-test-on-the-training-set">Build
a Muitinomial Naive Bayes model and self test on the training set</h2>
<div class="codehilite"><pre><span class="n">val</span> <span
class="n">model</span> <span class="p">=</span> <span
class="n">SparkNaiveBayes</span><span class="p">.</span><span
class="n">train</span><span class="p">(</span><span
class="n">aggregatedObservations</span><span class="p">,</span> <span
class="n">labelIndex</span><span class="p">,</span> <span
class="n">false</span><span class="p">)</span>
<span class="n">val</span> <span class="n">resAnalyzer</span> <span
class="p">=</span> <span class="n">SparkNaiveBayes</span><span
class="p">.</span><span class="n">test</span><span class="p">(</span><span
class="n">model</span><span class="p">,</span> <span
class="n">fullData</span><span class="p">,</span> <span
class="n">false</span><span class="p">)</span>
<span class="n">println</span><span class="p">(</span><span
class="n">resAnalyzer</span><span class="p">)</span>
</pre></div>
-<p>printing the result analyzer will display the confusion matrix</p>
-<h2 id="read-in-the-dictionary-and-document-frequency-count-from-hdfs">read in
the dictionary and document frequency count from HDFS</h2>
+<p>printing the <code>ResultAnalyzer</code> will display the confusion
matrix.</p>
+<h2 id="read-in-the-dictionary-and-document-frequency-count-from-hdfs">Read in
the dictionary and document frequency count from HDFS</h2>
<div class="codehilite"><pre><span class="n">val</span> <span
class="n">dictionary</span> <span class="p">=</span> <span
class="n">sdc</span><span class="p">.</span><span
class="n">sequenceFile</span><span class="p">(</span><span
class="n">pathToData</span> <span class="o">+</span> "<span
class="n">wikipediaVecs</span><span class="o">/</span><span
class="n">dictionary</span><span class="p">.</span><span
class="n">file</span><span class="o">-</span>0"<span class="p">,</span>
<span class="n">classOf</span><span
class="p">[</span><span class="n">Text</span><span class="p">],</span>
<span class="n">classOf</span><span
class="p">[</span><span class="n">IntWritable</span><span class="p">])</span>
@@ -339,8 +342,8 @@
</pre></div>
-<h2
id="define-a-function-to-tokeinze-and-vectorize-new-text-using-our-current-dictionary">define
a function to tokeinze and vectorize new text using our current dictionary</h2>
-<p>For this simple example, our function ```vectorizeDocument(...) will
tokenize a new document into unigrams using native Java String methods and
vectorize usingour dictionary and document frequencies. You could also use a <a
href="https://lucene.apache.org/core/">Lucene</a> analyzer for bigrams,
trigrams, etc., and integrate Apache <a
href="https://tika.apache.org/">Tika</a> to extract text from different
document types (PDF, PPT, XLS, etc.). Here, however we will kwwp ot simple and
split ouor text using regexs and native String methods.</p>
+<h2
id="define-a-function-to-tokenize-and-vectorize-new-text-using-our-current-dictionary">Define
a function to tokenize and vectorize new text using our current dictionary</h2>
+<p>For this simple example, our function ```vectorizeDocument(...) will
tokenize a new document into unigrams using native Java String methods and
vectorize using our dictionary and document frequencies. You could also use a
<a href="https://lucene.apache.org/core/">Lucene</a> analyzer for bigrams,
trigrams, etc., and integrate Apache <a
href="https://tika.apache.org/">Tika</a> to extract text from different
document types (PDF, PPT, XLS, etc.). Here, however we will keep it simple and
split our text using regexs and native String methods.</p>
<div class="codehilite"><pre>def vectorizeDocument<span
class="p">(</span>document: String<span class="p">,</span>
dictionaryMap: Map<span class="p">[</span>String<span
class="p">,</span>Int<span class="p">],</span>
dfMap: Map<span class="p">[</span>Int<span
class="p">,</span>Long<span class="p">])</span>: Vector <span
class="o">=</span> <span class="p">{</span>
@@ -371,7 +374,7 @@
</pre></div>
-<h2 id="setup-our-classifier">setup our classifier</h2>
+<h2 id="setup-our-classifier">Setup our classifier</h2>
<div class="codehilite"><pre><span class="n">val</span> <span
class="n">labelMap</span> <span class="p">=</span> <span
class="n">model</span><span class="p">.</span><span class="n">labelIndex</span>
<span class="n">val</span> <span class="n">numLabels</span> <span
class="p">=</span> <span class="n">model</span><span class="p">.</span><span
class="n">numLabels</span>
<span class="n">val</span> <span class="n">reverseLabelMap</span> <span
class="p">=</span> <span class="n">labelMap</span><span class="p">.</span><span
class="n">map</span><span class="p">(</span><span class="n">x</span> <span
class="p">=</span><span class="o">></span> <span class="n">x</span><span
class="p">.</span><span class="n">_2</span> <span class="o">-></span> <span
class="n">x</span><span class="p">.</span><span class="n">_1</span><span
class="p">)</span>
@@ -384,8 +387,8 @@
</pre></div>
-<h2 id="define-an-argmax-function">define an argmax function</h2>
-<p>The label with the higest score wins the classification for a given
document</p>
+<h2 id="define-an-argmax-function">Define an argmax function</h2>
+<p>The label with the highest score wins the classification for a given
document.</p>
<div class="codehilite"><pre>def argmax<span class="p">(</span>v: Vector<span
class="p">)</span>: <span class="p">(</span>Int<span class="p">,</span>
Double<span class="p">)</span> <span class="o">=</span> <span class="p">{</span>
var bestIdx: Int <span class="o">=</span> Integer.MIN_VALUE
var bestScore: Double <span class="o">=</span>
Integer.MIN_VALUE.asInstanceOf<span class="p">[</span>Int<span
class="p">]</span><span class="m">.</span>toDouble
@@ -400,7 +403,7 @@
</pre></div>
-<h2 id="define-our-final-tf-idf-vector-classifier">define our final TF(-IDF)
vector classifier</h2>
+<h2 id="define-our-final-tf-idf-vector-classifier">Define our final TF(-IDF)
vector classifier</h2>
<div class="codehilite"><pre><span class="n">def</span> <span
class="n">classifyDocument</span><span class="p">(</span><span
class="n">clvec</span><span class="p">:</span> <span
class="n">Vector</span><span class="p">)</span> <span class="p">:</span> <span
class="n">String</span> <span class="p">=</span> <span class="p">{</span>
<span class="n">val</span> <span class="n">cvec</span> <span
class="p">=</span> <span class="n">classifier</span><span
class="p">.</span><span class="n">classifyFull</span><span
class="p">(</span><span class="n">clvec</span><span class="p">)</span>
<span class="n">val</span> <span class="p">(</span><span
class="n">bestIdx</span><span class="p">,</span> <span
class="n">bestScore</span><span class="p">)</span> <span class="p">=</span>
<span class="n">argmax</span><span class="p">(</span><span
class="n">cvec</span><span class="p">)</span>
@@ -478,7 +481,7 @@
</pre></div>
-<h2 id="vectorize-and-classify-our-documents">vectorize and classify our
documents</h2>
+<h2 id="vectorize-and-classify-our-documents">Vectorize and classify our
documents</h2>
<div class="codehilite"><pre><span class="n">val</span> <span
class="n">usVec</span> <span class="p">=</span> <span
class="n">vectorizeDocument</span><span class="p">(</span><span
class="n">UStextToClassify</span><span class="p">,</span> <span
class="n">dictionaryMap</span><span class="p">,</span> <span
class="n">dfCountMap</span><span class="p">)</span>
<span class="n">val</span> <span class="n">ukVec</span> <span
class="p">=</span> <span class="n">vectorizeDocument</span><span
class="p">(</span><span class="n">UKtextToClassify</span><span
class="p">,</span> <span class="n">dictionaryMap</span><span class="p">,</span>
<span class="n">dfCountMap</span><span class="p">)</span>
@@ -490,7 +493,7 @@
</pre></div>
-<h2 id="tie-everything-together-in-a-new-method-to-classify-new-text">tie
everything together in a new method to classify new text</h2>
+<h2 id="tie-everything-together-in-a-new-method-to-classify-new-text">Tie
everything together in a new method to classify new text</h2>
<div class="codehilite"><pre><span class="n">def</span> <span
class="n">classifyText</span><span class="p">(</span><span
class="n">txt</span><span class="p">:</span> <span class="n">String</span><span
class="p">):</span> <span class="n">String</span> <span class="p">=</span>
<span class="p">{</span>
<span class="n">val</span> <span class="n">v</span> <span
class="p">=</span> <span class="n">vectorizeDocument</span><span
class="p">(</span><span class="n">txt</span><span class="p">,</span> <span
class="n">dictionaryMap</span><span class="p">,</span> <span
class="n">dfCountMap</span><span class="p">)</span>
<span class="n">classifyDocument</span><span class="p">(</span><span
class="n">v</span><span class="p">)</span>
@@ -499,7 +502,7 @@
</pre></div>
-<h2 id="now-we-can-simply-call-our-classifytext-method-on-any-string">now we
can simply call our classifyText method on any string</h2>
+<h2 id="now-we-can-simply-call-our-classifytext-method-on-any-string">Now we
can simply call our classifyText(...) method on any string</h2>
<div class="codehilite"><pre><span class="n">classifyText</span><span
class="p">(</span>"<span class="n">Hello</span> <span
class="n">world</span> <span class="n">from</span> <span
class="n">Queens</span>"<span class="p">)</span>
<span class="n">classifyText</span><span class="p">(</span>"<span
class="n">Hello</span> <span class="n">world</span> <span class="n">from</span>
<span class="n">London</span>"<span class="p">)</span>
</pre></div>