Author: buildbot
Date: Tue Apr 21 00:22:47 2015
New Revision: 948531
Log:
Staging update by buildbot for mahout
Modified:
websites/staging/mahout/trunk/content/ (props changed)
websites/staging/mahout/trunk/content/users/environment/how-to-build-an-app.html
Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Tue Apr 21 00:22:47 2015
@@ -1 +1 @@
-1675008
+1675009
Modified:
websites/staging/mahout/trunk/content/users/environment/how-to-build-an-app.html
==============================================================================
---
websites/staging/mahout/trunk/content/users/environment/how-to-build-an-app.html
(original)
+++
websites/staging/mahout/trunk/content/users/environment/how-to-build-an-app.html
Tue Apr 21 00:22:47 2015
@@ -273,106 +273,132 @@
</ul>
<h2 id="application">Application</h2>
<p>Using Mahout as a library in an application will require a little Scala
code. We have an App trait in Scala so we'll create an object, which inherits
from <code>App</code></p>
-<p><code>object CooccurrenceDriver extends App {
-}</code>
-This will look a little different than Java since <code>App</code> does
delayed initialization, which causes the main body to be executed when the App
is launched, just as in Java you would create a CooccurrenceDriver.main.</p>
+<div class="codehilite"><pre><span class="n">object</span> <span
class="n">CooccurrenceDriver</span> <span class="n">extends</span> <span
class="n">App</span> <span class="p">{</span>
+<span class="p">}</span>
+</pre></div>
+
+
+<p>This will look a little different than Java since <code>App</code> does
delayed initialization, which causes the main body to be executed when the App
is launched, just as in Java you would create a CooccurrenceDriver.main.</p>
<p>Before we can execute something on Spark we'll need to create a context. We
could use raw Spark calls here but default values are setup for a Ma // strip
off names, which only takes and array of IndexedDatasets
val indicatorMatrices = SimilarityAnalysis.cooccurrencesIDSs(actions.map(a
=> a._2))
hout context.</p>
-<p><code>implicit val mc = mahoutSparkContext(masterUrl = "local", appName =
"2-input-cooc")</code>
-We need to read in three files containing different interaction types. The
files will each be read into a Mahout IndexedDataset. This allows us to
preserve application-specific user and item IDs throughout the calculations.</p>
+<div class="codehilite"><pre><span class="n">implicit</span> <span
class="n">val</span> <span class="n">mc</span> <span class="p">=</span> <span
class="n">mahoutSparkContext</span><span class="p">(</span><span
class="n">masterUrl</span> <span class="p">=</span> "<span
class="n">local</span>"<span class="p">,</span> <span
class="n">appName</span> <span class="p">=</span> "2<span
class="o">-</span><span class="n">input</span><span class="o">-</span><span
class="n">cooc</span>"<span class="p">)</span>
+</pre></div>
+
+
+<p>We need to read in three files containing different interaction types. The
files will each be read into a Mahout IndexedDataset. This allows us to
preserve application-specific user and item IDs throughout the calculations.</p>
<p>For example, here is data/purchase.csv:</p>
-<p>```
-u1,iphone
-u1,ipad
-u2,nexus
-u2,galaxy
-u3,surface
-u4,iphone
-u4,galaxy</p>
-<p>```
-Mahout has a helper function that reads the text delimited in
SparkEngine.indexedDatasetDFSReadElements. The function reads single elements
in a distributed way to create the IndexedDataset. </p>
+<div class="codehilite"><pre><span class="n">u1</span><span
class="p">,</span><span class="n">iphone</span>
+<span class="n">u1</span><span class="p">,</span><span class="n">ipad</span>
+<span class="n">u2</span><span class="p">,</span><span class="n">nexus</span>
+<span class="n">u2</span><span class="p">,</span><span class="n">galaxy</span>
+<span class="n">u3</span><span class="p">,</span><span class="n">surface</span>
+<span class="n">u4</span><span class="p">,</span><span class="n">iphone</span>
+<span class="n">u4</span><span class="p">,</span><span class="n">galaxy</span>
+</pre></div>
+
+
+<p>Mahout has a helper function that reads the text delimited in
SparkEngine.indexedDatasetDFSReadElements. The function reads single elements
in a distributed way to create the IndexedDataset. </p>
<p>Notice we read in all datasets before we adjust the number of rows in them
to match the total number of users in the data. This is so the math works out
even if some users took one action but not another.</p>
-<p>```
-/*<em>
- * Read files of element tuples and create IndexedDatasets one per action.
These share a userID BiMap but have
- * their own itemID BiMaps
- </em>/
-def readActions(actionInput: Array[(String, String)]): Array[(String,
IndexedDataset)] = {
- var actions = Array<a href="">(String, IndexedDataset)</a></p>
-<p>val userDictionary: BiMap[String, Int] = HashBiMap.create()</p>
-<p>// The first action named in the sequence is the "primary" action and
- // begins to fill up the user dictionary
- for ( actionDescription <- actionInput ) {// grab the path to actions
- val action: IndexedDataset = SparkEngine.indexedDatasetDFSReadElements(
- actionDescription._2,
- schema = DefaultIndexedDatasetElementReadSchema,
- existingRowIDs = userDictionary)
- userDictionary.putAll(action.rowIDs)
- // put the name in the tuple with the indexedDataset
- actions = actions :+ (actionDescription._1, action)
- }</p>
-<p>// After all actions are read in the userDictonary will contain every user
seen,
- // even if they may not have taken all actions . Now we adjust the row rank
of
- // all IndxedDataset's to have this number of rows
- // Note: this is very important or the cooccurrence calc may fail
- val numUsers = userDictionary.size() // one more than the cardinality</p>
-<p>val resizedNameActionPairs = actions.map { a =>
- //resize the matrix by, in effect by adding empty rows
- val resizedMatrix = a._2.create(a._2.matrix, userDictionary,
a._2.columnIDs).newRowCardinality(numUsers)
- (a._1, resizedMatrix) // return the Tuple of (name, IndexedDataset)
- }
- resizedNameActionPairs // return the array of Tuples
-}</p>
-<p>```</p>
+<div class="codehilite"><pre><span class="o">/**</span>
+ <span class="o">*</span> Read files of element tuples and create
IndexedDatasets one per action. These share a userID BiMap but have
+ <span class="o">*</span> their own itemID BiMaps
+ <span class="o">*/</span>
+def readActions<span class="p">(</span>actionInput: Array<span
class="p">[(</span>String<span class="p">,</span> String<span
class="p">)])</span>: Array<span class="p">[(</span>String<span
class="p">,</span> IndexedDataset<span class="p">)]</span> <span
class="o">=</span> <span class="p">{</span>
+ var actions <span class="o">=</span> Array<span
class="p">[(</span>String<span class="p">,</span> IndexedDataset<span
class="p">)]()</span>
+
+ val userDictionary: BiMap<span class="p">[</span>String<span
class="p">,</span> Int<span class="p">]</span> <span class="o">=</span>
HashBiMap.create<span class="p">()</span>
+
+ <span class="o">//</span> The first action named in the sequence is the
<span class="s">"primary"</span> action and
+ <span class="o">//</span> begins to fill up the user dictionary
+ <span class="kr">for</span> <span class="p">(</span> actionDescription <span
class="o"><-</span> actionInput <span class="p">)</span> <span
class="p">{</span><span class="o">//</span> grab the path to actions
+ val action: IndexedDataset <span class="o">=</span>
SparkEngine.indexedDatasetDFSReadElements<span class="p">(</span>
+ actionDescription._2<span class="p">,</span>
+ schema <span class="o">=</span>
DefaultIndexedDatasetElementReadSchema<span class="p">,</span>
+ existingRowIDs <span class="o">=</span> userDictionary<span
class="p">)</span>
+ userDictionary.putAll<span class="p">(</span>action.rowIDs<span
class="p">)</span>
+ <span class="o">//</span> put the name in the tuple with the indexedDataset
+ actions <span class="o">=</span> actions :<span class="o">+</span> <span
class="p">(</span>actionDescription._1<span class="p">,</span> action<span
class="p">)</span>
+ <span class="p">}</span>
+
+ <span class="o">//</span> After all actions are read in the userDictonary
will contain every user seen<span class="p">,</span>
+ <span class="o">//</span> even if they may not have taken all actions <span
class="m">.</span> Now we adjust the row rank of
+ <span class="o">//</span> all IndxedDataset<span class="s">'</span><span
class="err">s to have this number of rows</span>
+ <span class="o">//</span> Note: this is very important or the cooccurrence
calc may fail
+ val numUsers <span class="o">=</span> userDictionary.size<span
class="p">()</span> <span class="o">//</span> one more than the cardinality
+
+ val resizedNameActionPairs <span class="o">=</span> actions.map <span
class="p">{</span> a <span class="o">=></span>
+ <span class="o">//</span>resize the matrix by<span class="p">,</span> in
effect by adding empty rows
+ val resizedMatrix <span class="o">=</span> a._2.create<span
class="p">(</span>a._2.matrix<span class="p">,</span> userDictionary<span
class="p">,</span> a._2.columnIDs<span class="p">)</span><span
class="m">.</span>newRowCardinality<span class="p">(</span>numUsers<span
class="p">)</span>
+ <span class="p">(</span>a._1<span class="p">,</span> resizedMatrix<span
class="p">)</span> <span class="o">//</span> return the Tuple of <span
class="p">(</span>name<span class="p">,</span> IndexedDataset<span
class="p">)</span>
+ <span class="p">}</span>
+ resizedNameActionPairs <span class="o">//</span> return the array of Tuples
+<span class="p">}</span>
+</pre></div>
+
+
<p>Now that we have the data read in we can perform the cooccurrence
calculation.</p>
-<p>```
-// strip off names, which only takes and array of IndexedDatasets
-val indicatorMatrices = SimilarityAnalysis.cooccurrencesIDSs(actions.map(a
=> a._2))</p>
-<p>```</p>
+<div class="codehilite"><pre><span class="c1">// strip off names, which only
takes and array of IndexedDatasets</span>
+<span class="n">val</span> <span class="n">indicatorMatrices</span> <span
class="o">=</span> <span class="n">SimilarityAnalysis</span><span
class="p">.</span><span class="n">cooccurrencesIDSs</span><span
class="p">(</span><span class="n">actions</span><span class="p">.</span><span
class="n">map</span><span class="p">(</span><span class="n">a</span> <span
class="o">=></span> <span class="n">a</span><span class="p">.</span><span
class="n">_2</span><span class="p">))</span>
+</pre></div>
+
+
<p>All we need to do now is write the indicators.</p>
-<p><code>// zip a pair of arrays into an array of pairs, reattaching the
action names
-val indicatorDescriptions = actions.map(a => a._1).zip(indicatorMatrices)
-writeIndicators(indicatorDescriptions)</code></p>
+<div class="codehilite"><pre><span class="c1">// zip a pair of arrays into an
array of pairs, reattaching the action names</span>
+<span class="n">val</span> <span class="n">indicatorDescriptions</span> <span
class="o">=</span> <span class="n">actions</span><span class="p">.</span><span
class="n">map</span><span class="p">(</span><span class="n">a</span> <span
class="o">=></span> <span class="n">a</span><span class="p">.</span><span
class="n">_1</span><span class="p">).</span><span class="n">zip</span><span
class="p">(</span><span class="n">indicatorMatrices</span><span
class="p">)</span>
+</pre></div>
+
+
+<p>writeIndicators(indicatorDescriptions)</p>
<p>The <code>writeIndicators</code> method uses the default write function
<code>dfsWrite</code>.</p>
-<p>```
-/*<em>
- * Write indicatorMatrices to the output dir in the default format
- </em>/
-def writeIndicators( indicators: Array[(String, IndexedDataset)]) = {
- for (indicator <- indicators ) {
- val indicatorDir = OutputPath + indicator._1
- indicator._2.dfsWrite(
- indicatorDir, // do we have to remove the last $ char?
- // omit LLR strengths and format for search engine indexing
- IndexedDatasetWriteBooleanSchema)
- }
-}</p>
-<p>```</p>
+<div class="codehilite"><pre><span class="o">/**</span>
+ <span class="o">*</span> Write indicatorMatrices to the output dir in the
default format
+ <span class="o">*/</span>
+def writeIndicators<span class="p">(</span> indicators: Array<span
class="p">[(</span>String<span class="p">,</span> IndexedDataset<span
class="p">)])</span> <span class="o">=</span> <span class="p">{</span>
+ <span class="kr">for</span> <span class="p">(</span>indicator <span
class="o"><-</span> indicators <span class="p">)</span> <span
class="p">{</span>
+ val indicatorDir <span class="o">=</span> OutputPath <span
class="o">+</span> indicator._1
+ indicator._2.dfsWrite<span class="p">(</span>
+ indicatorDir<span class="p">,</span> <span class="o">//</span> do we
have to remove the last <span class="p">$</span> char?
+ <span class="o">//</span> omit LLR strengths and format for search
engine indexing
+ IndexedDatasetWriteBooleanSchema<span class="p">)</span>
+ <span class="p">}</span>
+<span class="p">}</span>
+</pre></div>
+
+
<p>See the Github project for the full source. Now we create a build.sbt to
build the example. </p>
-<p>```
-name := "cooccurrence-driver"</p>
-<p>organization := "com.finderbots"</p>
-<p>version := "0.1"</p>
-<p>scalaVersion := "2.10.4"</p>
-<p>val sparkVersion = "1.1.1"</p>
-<p>libraryDependencies ++= Seq(
- "log4j" % "log4j" % "1.2.17",
- // Mahout's Spark code
- "commons-io" % "commons-io" % "2.4",
- "org.apache.mahout" % "mahout-math-scala_2.10" % "0.10.0",
- "org.apache.mahout" % "mahout-spark_2.10" % "0.10.0",
- "org.apache.mahout" % "mahout-math" % "0.10.0",
- "org.apache.mahout" % "mahout-hdfs" % "0.10.0",
- // Google collections, AKA Guava
- "com.google.guava" % "guava" % "16.0")</p>
-<p>resolvers += "typesafe repo" at "
http://repo.typesafe.com/typesafe/releases/"</p>
-<p>resolvers += Resolver.mavenLocal</p>
-<p>packSettings</p>
-<p>packMain := Map(
- "cooc" -> "CooccurrenceDriver"
-)</p>
-<p>```</p>
+<div class="codehilite"><pre><span class="n">name</span> <span
class="p">:=</span> "<span class="n">cooccurrence</span><span
class="o">-</span><span class="n">driver</span>"
+
+<span class="n">organization</span> <span class="p">:=</span> "<span
class="n">com</span><span class="p">.</span><span
class="n">finderbots</span>"
+
+<span class="n">version</span> <span class="p">:=</span> "0<span
class="p">.</span>1"
+
+<span class="n">scalaVersion</span> <span class="p">:=</span> "2<span
class="p">.</span>10<span class="p">.</span>4"
+
+<span class="n">val</span> <span class="n">sparkVersion</span> <span
class="p">=</span> "1<span class="p">.</span>1<span
class="p">.</span>1"
+
+<span class="n">libraryDependencies</span> <span class="o">++</span><span
class="p">=</span> <span class="n">Seq</span><span class="p">(</span>
+ "<span class="n">log4j</span>" <span class="c">% "log4j"
% "1.2.17",</span>
+ <span class="o">//</span> <span class="n">Mahout</span><span
class="o">'</span><span class="n">s</span> <span class="n">Spark</span>
<span class="n">code</span>
+ "<span class="n">commons</span><span class="o">-</span><span
class="n">io</span>" <span class="c">% "commons-io" %
"2.4",</span>
+ "<span class="n">org</span><span class="p">.</span><span
class="n">apache</span><span class="p">.</span><span
class="n">mahout</span>" <span class="c">%
"mahout-math-scala_2.10" % "0.10.0",</span>
+ "<span class="n">org</span><span class="p">.</span><span
class="n">apache</span><span class="p">.</span><span
class="n">mahout</span>" <span class="c">% "mahout-spark_2.10" %
"0.10.0",</span>
+ "<span class="n">org</span><span class="p">.</span><span
class="n">apache</span><span class="p">.</span><span
class="n">mahout</span>" <span class="c">% "mahout-math" %
"0.10.0",</span>
+ "<span class="n">org</span><span class="p">.</span><span
class="n">apache</span><span class="p">.</span><span
class="n">mahout</span>" <span class="c">% "mahout-hdfs" %
"0.10.0",</span>
+ <span class="o">//</span> <span class="n">Google</span> <span
class="n">collections</span><span class="p">,</span> <span class="n">AKA</span>
<span class="n">Guava</span>
+ "<span class="n">com</span><span class="p">.</span><span
class="n">google</span><span class="p">.</span><span
class="n">guava</span>" <span class="c">% "guava" %
"16.0")</span>
+
+<span class="n">resolvers</span> <span class="o">+</span><span
class="p">=</span> "<span class="n">typesafe</span> <span
class="n">repo</span>" <span class="n">at</span> " <span
class="n">http</span><span class="p">:</span><span class="o">//</span><span
class="n">repo</span><span class="p">.</span><span
class="n">typesafe</span><span class="p">.</span><span
class="n">com</span><span class="o">/</span><span
class="n">typesafe</span><span class="o">/</span><span
class="n">releases</span><span class="o">/</span>"
+
+<span class="n">resolvers</span> <span class="o">+</span><span
class="p">=</span> <span class="n">Resolver</span><span class="p">.</span><span
class="n">mavenLocal</span>
+
+<span class="n">packSettings</span>
+
+<span class="n">packMain</span> <span class="p">:=</span> <span
class="n">Map</span><span class="p">(</span>
+ "<span class="n">cooc</span>" <span class="o">-></span>
"<span class="n">CooccurrenceDriver</span>"<span class="p">)</span>
+</pre></div>
+
+
<h2 id="build">Build</h2>
<p>Building the examples from project's root folder:</p>
<div class="codehilite"><pre>$ <span class="n">sbt</span> <span
class="n">pack</span>