standard.html

apalumbo Thu, 23 Apr 2015 12:54:17 -0700

Author: apalumbo
Date: Thu Apr 23 19:52:58 2015
New Revision: 1675714

URL: http://svn.apache.org/r1675714
Log:
change the desciption and fix some inline code, change the title of the menue 
link, add in model persistance section.


Modified:
    
mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext
    mahout/site/mahout_cms/trunk/templates/standard.html

Modified: 
mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext
URL: 
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext?rev=1675714&r1=1675713&r2=1675714&view=diff
==============================================================================
--- 
mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext
 (original)
+++ 
mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext
 Thu Apr 23 19:52:58 2015
@@ -1,6 +1,6 @@
-#Classifying a Document with the Mahout Shell
+#Building a text classifier in Mahout's Spark Shell
 
-This tutorial will take you through the steps used to train and create a 
Multinomial Naive Bayes text classifier using the ```mahout spark-shell```. 
+This tutorial will take you through the steps used to train a Multinomial 
Naive Bayes model and create a text classifier based on that model using the 
```mahout spark-shell```. 
 
 ## Prerequisites
 This tutorial assumes that you have your Spark environment variables set for 
the ```mahout spark-shell``` see: [Playing with Mahout's 
Shell](http://mahout.apache.org/users/sparkbindings/play-with-shell.html).  As 
well we assume that Mahout is running in cluster mode (i.e. with the 
```MAHOUT_LOCAL``` environment variable **unset**) as we'll be reading and 
writing to HDFS.
@@ -43,7 +43,7 @@ Hadoop Imports needed to read our dictio
     val pathToData = "/tmp/mahout-work-wiki/"
     val fullData = drmDfsRead(pathToData + "wikipediaVecs/tfidf-vectors")
 
-## Extract the category of each observation and aggregate those observation by 
category
+## Extract the category of each observation and aggregate those observations 
by category
 
     val (labelIndex, aggregatedObservations) = 
SparkNaiveBayes.extractLabelsAndAggregateObservations(
                                                                  fullData)
@@ -81,7 +81,7 @@ printing the ```ResultAnalyzer``` will d
 
 ## Define a function to tokenize and vectorize new text using our current 
dictionary
 
-For this simple example, our function ```vectorizeDocument(...) will tokenize 
a new document into unigrams using native Java String methods and vectorize 
using our dictionary and document frequencies. You could also use a 
[Lucene](https://lucene.apache.org/core/) analyzer for bigrams, trigrams, etc., 
and integrate Apache [Tika](https://tika.apache.org/) to extract text from 
different document types (PDF, PPT, XLS, etc.).  Here, however we will keep it 
simple and split our text using regexs and native String methods.
+For this simple example, our function ```vectorizeDocument(...)``` will 
tokenize a new document into unigrams using native Java String methods and 
vectorize using our dictionary and document frequencies. You could also use a 
[Lucene](https://lucene.apache.org/core/) analyzer for bigrams, trigrams, etc., 
and integrate Apache [Tika](https://tika.apache.org/) to extract text from 
different document types (PDF, PPT, XLS, etc.).  Here, however we will keep it 
simple, stripping and tokenizing our text using regexs and native String 
methods.
 
     def vectorizeDocument(document: String,
                             dictionaryMap: Map[String,Int],
@@ -139,7 +139,7 @@ The label with the highest score wins th
         (bestIdx, bestScore)
     }
 
-## Define our final TF(-IDF) vector classifier
+## Define our TF(-IDF) vector classifier
 
     def classifyDocument(clvec: Vector) : String = {
         val cvec = classifier.classifyFull(clvec)
@@ -226,7 +226,7 @@ The label with the highest score wins th
     println("Classifying the news article about Manchester United (united 
kingdom)")
     classifyDocument(ukVec)
 
-## Tie everything together in a new method to classify new text 
+## Tie everything together in a new method to classify text 
     
     def classifyText(txt: String): String = {
         val v = vectorizeDocument(txt, dictionaryMap, dfCountMap)
@@ -237,4 +237,16 @@ The label with the highest score wins th
 ## Now we can simply call our classifyText(...) method on any String
 
     classifyText("Hello world from Queens")
-    classifyText("Hello world from London")
\ No newline at end of file
+    classifyText("Hello world from London")
+    
+## Model persistance
+
+You can save the model to HDFS:
+
+    model.dfsWrite("/path/to/model")
+    
+And retrieve it with:
+
+    val model =  NBModel.dfsRead("/path/to/model")
+
+The trained model can now be embedded in an external application.
\ No newline at end of file

Modified: mahout/site/mahout_cms/trunk/templates/standard.html
URL: 
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/templates/standard.html?rev=1675714&r1=1675713&r2=1675714&view=diff
==============================================================================
--- mahout/site/mahout_cms/trunk/templates/standard.html (original)
+++ mahout/site/mahout_cms/trunk/templates/standard.html Thu Apr 23 19:52:58 
2015
@@ -151,7 +151,7 @@
                   <li class="nav-header">Tutorials</li>
                   <li><a 
href="/users/sparkbindings/play-with-shell.html">Playing with Mahout's Spark 
Shell</a></li>
                   <li><a 
href="/users/environment/how-to-build-an-app.html">How to build an app</a></li>
-                  <li><a 
href="/users/environment/classify-a-doc-from-the-shell.html">Classify a 
document from the Shell</a></li>
+                  <li><a 
href="/users/environment/classify-a-doc-from-the-shell.html">Building a text 
classifier in Mahout's Spark Shell</a></li>
                 </ul>
               </li>
               <li class="dropdown"> <a href="#" class="dropdown-toggle" 
data-toggle="dropdown">Algorithms<b class="caret"></b></a>

svn commit: r1675714 - in /mahout/site/mahout_cms/trunk: content/users/environment/classify-a-doc-from-the-shell.mdtext templates/standard.html

Reply via email to