Author: apalumbo
Date: Thu Apr 23 19:52:58 2015
New Revision: 1675714
URL: http://svn.apache.org/r1675714
Log:
change the desciption and fix some inline code, change the title of the menue
link, add in model persistance section.
Modified:
mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext
mahout/site/mahout_cms/trunk/templates/standard.html
Modified:
mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext
URL:
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext?rev=1675714&r1=1675713&r2=1675714&view=diff
==============================================================================
---
mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext
(original)
+++
mahout/site/mahout_cms/trunk/content/users/environment/classify-a-doc-from-the-shell.mdtext
Thu Apr 23 19:52:58 2015
@@ -1,6 +1,6 @@
-#Classifying a Document with the Mahout Shell
+#Building a text classifier in Mahout's Spark Shell
-This tutorial will take you through the steps used to train and create a
Multinomial Naive Bayes text classifier using the ```mahout spark-shell```.
+This tutorial will take you through the steps used to train a Multinomial
Naive Bayes model and create a text classifier based on that model using the
```mahout spark-shell```.
## Prerequisites
This tutorial assumes that you have your Spark environment variables set for
the ```mahout spark-shell``` see: [Playing with Mahout's
Shell](http://mahout.apache.org/users/sparkbindings/play-with-shell.html). As
well we assume that Mahout is running in cluster mode (i.e. with the
```MAHOUT_LOCAL``` environment variable **unset**) as we'll be reading and
writing to HDFS.
@@ -43,7 +43,7 @@ Hadoop Imports needed to read our dictio
val pathToData = "/tmp/mahout-work-wiki/"
val fullData = drmDfsRead(pathToData + "wikipediaVecs/tfidf-vectors")
-## Extract the category of each observation and aggregate those observation by
category
+## Extract the category of each observation and aggregate those observations
by category
val (labelIndex, aggregatedObservations) =
SparkNaiveBayes.extractLabelsAndAggregateObservations(
fullData)
@@ -81,7 +81,7 @@ printing the ```ResultAnalyzer``` will d
## Define a function to tokenize and vectorize new text using our current
dictionary
-For this simple example, our function ```vectorizeDocument(...) will tokenize
a new document into unigrams using native Java String methods and vectorize
using our dictionary and document frequencies. You could also use a
[Lucene](https://lucene.apache.org/core/) analyzer for bigrams, trigrams, etc.,
and integrate Apache [Tika](https://tika.apache.org/) to extract text from
different document types (PDF, PPT, XLS, etc.). Here, however we will keep it
simple and split our text using regexs and native String methods.
+For this simple example, our function ```vectorizeDocument(...)``` will
tokenize a new document into unigrams using native Java String methods and
vectorize using our dictionary and document frequencies. You could also use a
[Lucene](https://lucene.apache.org/core/) analyzer for bigrams, trigrams, etc.,
and integrate Apache [Tika](https://tika.apache.org/) to extract text from
different document types (PDF, PPT, XLS, etc.). Here, however we will keep it
simple, stripping and tokenizing our text using regexs and native String
methods.
def vectorizeDocument(document: String,
dictionaryMap: Map[String,Int],
@@ -139,7 +139,7 @@ The label with the highest score wins th
(bestIdx, bestScore)
}
-## Define our final TF(-IDF) vector classifier
+## Define our TF(-IDF) vector classifier
def classifyDocument(clvec: Vector) : String = {
val cvec = classifier.classifyFull(clvec)
@@ -226,7 +226,7 @@ The label with the highest score wins th
println("Classifying the news article about Manchester United (united
kingdom)")
classifyDocument(ukVec)
-## Tie everything together in a new method to classify new text
+## Tie everything together in a new method to classify text
def classifyText(txt: String): String = {
val v = vectorizeDocument(txt, dictionaryMap, dfCountMap)
@@ -237,4 +237,16 @@ The label with the highest score wins th
## Now we can simply call our classifyText(...) method on any String
classifyText("Hello world from Queens")
- classifyText("Hello world from London")
\ No newline at end of file
+ classifyText("Hello world from London")
+
+## Model persistance
+
+You can save the model to HDFS:
+
+ model.dfsWrite("/path/to/model")
+
+And retrieve it with:
+
+ val model = NBModel.dfsRead("/path/to/model")
+
+The trained model can now be embedded in an external application.
\ No newline at end of file
Modified: mahout/site/mahout_cms/trunk/templates/standard.html
URL:
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/templates/standard.html?rev=1675714&r1=1675713&r2=1675714&view=diff
==============================================================================
--- mahout/site/mahout_cms/trunk/templates/standard.html (original)
+++ mahout/site/mahout_cms/trunk/templates/standard.html Thu Apr 23 19:52:58
2015
@@ -151,7 +151,7 @@
<li class="nav-header">Tutorials</li>
<li><a
href="/users/sparkbindings/play-with-shell.html">Playing with Mahout's Spark
Shell</a></li>
<li><a
href="/users/environment/how-to-build-an-app.html">How to build an app</a></li>
- <li><a
href="/users/environment/classify-a-doc-from-the-shell.html">Classify a
document from the Shell</a></li>
+ <li><a
href="/users/environment/classify-a-doc-from-the-shell.html">Building a text
classifier in Mahout's Spark Shell</a></li>
</ul>
</li>
<li class="dropdown"> <a href="#" class="dropdown-toggle"
data-toggle="dropdown">Algorithms<b class="caret"></b></a>