how-to-build-an-app.html

buildbot Mon, 20 Apr 2015 17:23:04 -0700

Author: buildbot
Date: Tue Apr 21 00:22:47 2015
New Revision: 948531

Log:
Staging update by buildbot for mahout


Modified:
    websites/staging/mahout/trunk/content/   (props changed)
    
websites/staging/mahout/trunk/content/users/environment/how-to-build-an-app.html

Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Tue Apr 21 00:22:47 2015
@@ -1 +1 @@
-1675008
+1675009

Modified: 
websites/staging/mahout/trunk/content/users/environment/how-to-build-an-app.html
==============================================================================
--- 
websites/staging/mahout/trunk/content/users/environment/how-to-build-an-app.html
 (original)
+++ 
websites/staging/mahout/trunk/content/users/environment/how-to-build-an-app.html
 Tue Apr 21 00:22:47 2015
@@ -273,106 +273,132 @@
 </ul>
 <h2 id="application">Application</h2>
 <p>Using Mahout as a library in an application will require a little Scala 
code. We have an App trait in Scala so we'll create an object, which inherits 
from <code>App</code></p>
-<p><code>object CooccurrenceDriver extends App {
-}</code>
-This will look a little different than Java since <code>App</code> does 
delayed initialization, which causes the main body to be executed when the App 
is launched, just as in Java you would create a CooccurrenceDriver.main.</p>
+<div class="codehilite"><pre><span class="n">object</span> <span 
class="n">CooccurrenceDriver</span> <span class="n">extends</span> <span 
class="n">App</span> <span class="p">{</span>
+<span class="p">}</span>
+</pre></div>
+
+
+<p>This will look a little different than Java since <code>App</code> does 
delayed initialization, which causes the main body to be executed when the App 
is launched, just as in Java you would create a CooccurrenceDriver.main.</p>
 <p>Before we can execute something on Spark we'll need to create a context. We 
could use raw Spark calls here but default values are setup for a Ma  // strip 
off names, which only takes and array of IndexedDatasets
   val indicatorMatrices = SimilarityAnalysis.cooccurrencesIDSs(actions.map(a 
=&gt; a._2))
 hout context.</p>
-<p><code>implicit val mc = mahoutSparkContext(masterUrl = "local", appName = 
"2-input-cooc")</code>
-We need to read in three files containing different interaction types. The 
files will each be read into a Mahout IndexedDataset. This allows us to 
preserve application-specific user and item IDs throughout the calculations.</p>
+<div class="codehilite"><pre><span class="n">implicit</span> <span 
class="n">val</span> <span class="n">mc</span> <span class="p">=</span> <span 
class="n">mahoutSparkContext</span><span class="p">(</span><span 
class="n">masterUrl</span> <span class="p">=</span> &quot;<span 
class="n">local</span>&quot;<span class="p">,</span> <span 
class="n">appName</span> <span class="p">=</span> &quot;2<span 
class="o">-</span><span class="n">input</span><span class="o">-</span><span 
class="n">cooc</span>&quot;<span class="p">)</span>
+</pre></div>
+
+
+<p>We need to read in three files containing different interaction types. The 
files will each be read into a Mahout IndexedDataset. This allows us to 
preserve application-specific user and item IDs throughout the calculations.</p>
 <p>For example, here is data/purchase.csv:</p>
-<p>```
-u1,iphone
-u1,ipad
-u2,nexus
-u2,galaxy
-u3,surface
-u4,iphone
-u4,galaxy</p>
-<p>```
-Mahout has a helper function that reads the text delimited in 
SparkEngine.indexedDatasetDFSReadElements. The function reads single elements 
in a distributed way to create the IndexedDataset. </p>
+<div class="codehilite"><pre><span class="n">u1</span><span 
class="p">,</span><span class="n">iphone</span>
+<span class="n">u1</span><span class="p">,</span><span class="n">ipad</span>
+<span class="n">u2</span><span class="p">,</span><span class="n">nexus</span>
+<span class="n">u2</span><span class="p">,</span><span class="n">galaxy</span>
+<span class="n">u3</span><span class="p">,</span><span class="n">surface</span>
+<span class="n">u4</span><span class="p">,</span><span class="n">iphone</span>
+<span class="n">u4</span><span class="p">,</span><span class="n">galaxy</span>
+</pre></div>
+
+
+<p>Mahout has a helper function that reads the text delimited in 
SparkEngine.indexedDatasetDFSReadElements. The function reads single elements 
in a distributed way to create the IndexedDataset. </p>
 <p>Notice we read in all datasets before we adjust the number of rows in them 
to match the total number of users in the data. This is so the math works out 
even if some users took one action but not another.</p>
-<p>```
-/*<em>
- * Read files of element tuples and create IndexedDatasets one per action. 
These share a userID BiMap but have
- * their own itemID BiMaps
- </em>/
-def readActions(actionInput: Array[(String, String)]): Array[(String, 
IndexedDataset)] = {
-  var actions = Array<a href="">(String, IndexedDataset)</a></p>
-<p>val userDictionary: BiMap[String, Int] = HashBiMap.create()</p>
-<p>// The first action named in the sequence is the "primary" action and 
-  // begins to fill up the user dictionary
-  for ( actionDescription &lt;- actionInput ) {// grab the path to actions
-    val action: IndexedDataset = SparkEngine.indexedDatasetDFSReadElements(
-      actionDescription._2,
-      schema = DefaultIndexedDatasetElementReadSchema,
-      existingRowIDs = userDictionary)
-    userDictionary.putAll(action.rowIDs)
-    // put the name in the tuple with the indexedDataset
-    actions = actions :+ (actionDescription._1, action) 
-  }</p>
-<p>// After all actions are read in the userDictonary will contain every user 
seen, 
-  // even if they may not have taken all actions . Now we adjust the row rank 
of 
-  // all IndxedDataset's to have this number of rows
-  // Note: this is very important or the cooccurrence calc may fail
-  val numUsers = userDictionary.size() // one more than the cardinality</p>
-<p>val resizedNameActionPairs = actions.map { a =&gt;
-    //resize the matrix by, in effect by adding empty rows
-    val resizedMatrix = a._2.create(a._2.matrix, userDictionary, 
a._2.columnIDs).newRowCardinality(numUsers)
-    (a._1, resizedMatrix) // return the Tuple of (name, IndexedDataset)
-  }
-  resizedNameActionPairs // return the array of Tuples
-}</p>
-<p>```</p>
+<div class="codehilite"><pre><span class="o">/**</span>
+ <span class="o">*</span> Read files of element tuples and create 
IndexedDatasets one per action. These share     a userID BiMap but have
+ <span class="o">*</span> their own itemID BiMaps
+ <span class="o">*/</span>
+def readActions<span class="p">(</span>actionInput: Array<span 
class="p">[(</span>String<span class="p">,</span> String<span 
class="p">)])</span>: Array<span class="p">[(</span>String<span 
class="p">,</span> IndexedDataset<span class="p">)]</span> <span 
class="o">=</span> <span class="p">{</span>
+  var actions <span class="o">=</span> Array<span 
class="p">[(</span>String<span class="p">,</span> IndexedDataset<span 
class="p">)]()</span>
+
+  val userDictionary: BiMap<span class="p">[</span>String<span 
class="p">,</span> Int<span class="p">]</span> <span class="o">=</span> 
HashBiMap.create<span class="p">()</span>
+
+  <span class="o">//</span> The first action named in the sequence is the 
<span class="s">&quot;primary&quot;</span> action and 
+  <span class="o">//</span> begins to fill up the user dictionary
+  <span class="kr">for</span> <span class="p">(</span> actionDescription <span 
class="o">&lt;-</span> actionInput <span class="p">)</span> <span 
class="p">{</span><span class="o">//</span> grab the path to actions
+    val action: IndexedDataset <span class="o">=</span> 
SparkEngine.indexedDatasetDFSReadElements<span class="p">(</span>
+      actionDescription._2<span class="p">,</span>
+      schema <span class="o">=</span> 
DefaultIndexedDatasetElementReadSchema<span class="p">,</span>
+      existingRowIDs <span class="o">=</span> userDictionary<span 
class="p">)</span>
+    userDictionary.putAll<span class="p">(</span>action.rowIDs<span 
class="p">)</span>
+    <span class="o">//</span> put the name in the tuple with the indexedDataset
+    actions <span class="o">=</span> actions :<span class="o">+</span> <span 
class="p">(</span>actionDescription._1<span class="p">,</span> action<span 
class="p">)</span> 
+  <span class="p">}</span>
+
+  <span class="o">//</span> After all actions are read in the userDictonary 
will contain every user seen<span class="p">,</span> 
+  <span class="o">//</span> even if they may not have taken all actions <span 
class="m">.</span> Now we adjust the row rank of 
+  <span class="o">//</span> all IndxedDataset<span class="s">&#39;</span><span 
class="err">s to have this number of rows</span>
+  <span class="o">//</span> Note: this is very important or the cooccurrence 
calc may fail
+  val numUsers <span class="o">=</span> userDictionary.size<span 
class="p">()</span> <span class="o">//</span> one more than the cardinality
+
+  val resizedNameActionPairs <span class="o">=</span> actions.map <span 
class="p">{</span> a <span class="o">=&gt;</span>
+    <span class="o">//</span>resize the matrix by<span class="p">,</span> in 
effect by adding empty rows
+    val resizedMatrix <span class="o">=</span> a._2.create<span 
class="p">(</span>a._2.matrix<span class="p">,</span> userDictionary<span 
class="p">,</span> a._2.columnIDs<span class="p">)</span><span 
class="m">.</span>newRowCardinality<span class="p">(</span>numUsers<span 
class="p">)</span>
+    <span class="p">(</span>a._1<span class="p">,</span> resizedMatrix<span 
class="p">)</span> <span class="o">//</span> return the Tuple of <span 
class="p">(</span>name<span class="p">,</span> IndexedDataset<span 
class="p">)</span>
+  <span class="p">}</span>
+  resizedNameActionPairs <span class="o">//</span> return the array of Tuples
+<span class="p">}</span>
+</pre></div>
+
+
 <p>Now that we have the data read in we can perform the cooccurrence 
calculation.</p>
-<p>```
-// strip off names, which only takes and array of IndexedDatasets
-val indicatorMatrices = SimilarityAnalysis.cooccurrencesIDSs(actions.map(a 
=&gt; a._2))</p>
-<p>```</p>
+<div class="codehilite"><pre><span class="c1">// strip off names, which only 
takes and array of IndexedDatasets</span>
+<span class="n">val</span> <span class="n">indicatorMatrices</span> <span 
class="o">=</span> <span class="n">SimilarityAnalysis</span><span 
class="p">.</span><span class="n">cooccurrencesIDSs</span><span 
class="p">(</span><span class="n">actions</span><span class="p">.</span><span 
class="n">map</span><span class="p">(</span><span class="n">a</span> <span 
class="o">=&gt;</span> <span class="n">a</span><span class="p">.</span><span 
class="n">_2</span><span class="p">))</span>
+</pre></div>
+
+
 <p>All we need to do now is write the indicators.</p>
-<p><code>// zip a pair of arrays into an array of pairs, reattaching the 
action names
-val indicatorDescriptions = actions.map(a =&gt; a._1).zip(indicatorMatrices)
-writeIndicators(indicatorDescriptions)</code></p>
+<div class="codehilite"><pre><span class="c1">// zip a pair of arrays into an 
array of pairs, reattaching the action names</span>
+<span class="n">val</span> <span class="n">indicatorDescriptions</span> <span 
class="o">=</span> <span class="n">actions</span><span class="p">.</span><span 
class="n">map</span><span class="p">(</span><span class="n">a</span> <span 
class="o">=&gt;</span> <span class="n">a</span><span class="p">.</span><span 
class="n">_1</span><span class="p">).</span><span class="n">zip</span><span 
class="p">(</span><span class="n">indicatorMatrices</span><span 
class="p">)</span>
+</pre></div>
+
+
+<p>writeIndicators(indicatorDescriptions)</p>
 <p>The <code>writeIndicators</code> method uses the default write function 
<code>dfsWrite</code>.</p>
-<p>```
-/*<em>
- * Write indicatorMatrices to the output dir in the default format
- </em>/
-def writeIndicators( indicators: Array[(String, IndexedDataset)]) = {
-  for (indicator &lt;- indicators ) {
-    val indicatorDir = OutputPath + indicator._1
-    indicator._2.dfsWrite(
-      indicatorDir, // do we have to remove the last $ char?
-      // omit LLR strengths and format for search engine indexing
-      IndexedDatasetWriteBooleanSchema) 
-  }
-}</p>
-<p>```</p>
+<div class="codehilite"><pre><span class="o">/**</span>
+ <span class="o">*</span> Write indicatorMatrices to the output dir in the 
default format
+ <span class="o">*/</span>
+def writeIndicators<span class="p">(</span> indicators: Array<span 
class="p">[(</span>String<span class="p">,</span> IndexedDataset<span 
class="p">)])</span> <span class="o">=</span> <span class="p">{</span>
+  <span class="kr">for</span> <span class="p">(</span>indicator <span 
class="o">&lt;-</span> indicators <span class="p">)</span> <span 
class="p">{</span>
+    val indicatorDir <span class="o">=</span> OutputPath <span 
class="o">+</span> indicator._1
+    indicator._2.dfsWrite<span class="p">(</span>
+      indicatorDir<span class="p">,</span> <span class="o">//</span> do we 
have to remove the last <span class="p">$</span> char?
+      <span class="o">//</span> omit LLR strengths and format for search 
engine indexing
+      IndexedDatasetWriteBooleanSchema<span class="p">)</span> 
+  <span class="p">}</span>
+<span class="p">}</span>
+</pre></div>
+
+
 <p>See the Github project for the full source. Now we create a build.sbt to 
build the example. </p>
-<p>```
-name := "cooccurrence-driver"</p>
-<p>organization := "com.finderbots"</p>
-<p>version := "0.1"</p>
-<p>scalaVersion := "2.10.4"</p>
-<p>val sparkVersion = "1.1.1"</p>
-<p>libraryDependencies ++= Seq(
-  "log4j" % "log4j" % "1.2.17",
-  // Mahout's Spark code
-  "commons-io" % "commons-io" % "2.4",
-  "org.apache.mahout" % "mahout-math-scala_2.10" % "0.10.0",
-  "org.apache.mahout" % "mahout-spark_2.10" % "0.10.0",
-  "org.apache.mahout" % "mahout-math" % "0.10.0",
-  "org.apache.mahout" % "mahout-hdfs" % "0.10.0",
-  // Google collections, AKA Guava
-  "com.google.guava" % "guava" % "16.0")</p>
-<p>resolvers += "typesafe repo" at " 
http://repo.typesafe.com/typesafe/releases/";</p>
-<p>resolvers += Resolver.mavenLocal</p>
-<p>packSettings</p>
-<p>packMain := Map(
-  "cooc" -&gt; "CooccurrenceDriver"
-)</p>
-<p>```</p>
+<div class="codehilite"><pre><span class="n">name</span> <span 
class="p">:=</span> &quot;<span class="n">cooccurrence</span><span 
class="o">-</span><span class="n">driver</span>&quot;
+
+<span class="n">organization</span> <span class="p">:=</span> &quot;<span 
class="n">com</span><span class="p">.</span><span 
class="n">finderbots</span>&quot;
+
+<span class="n">version</span> <span class="p">:=</span> &quot;0<span 
class="p">.</span>1&quot;
+
+<span class="n">scalaVersion</span> <span class="p">:=</span> &quot;2<span 
class="p">.</span>10<span class="p">.</span>4&quot;
+
+<span class="n">val</span> <span class="n">sparkVersion</span> <span 
class="p">=</span> &quot;1<span class="p">.</span>1<span 
class="p">.</span>1&quot;
+
+<span class="n">libraryDependencies</span> <span class="o">++</span><span 
class="p">=</span> <span class="n">Seq</span><span class="p">(</span>
+  &quot;<span class="n">log4j</span>&quot; <span class="c">% &quot;log4j&quot; 
% &quot;1.2.17&quot;,</span>
+  <span class="o">//</span> <span class="n">Mahout</span><span 
class="o">&#39;</span><span class="n">s</span> <span class="n">Spark</span> 
<span class="n">code</span>
+  &quot;<span class="n">commons</span><span class="o">-</span><span 
class="n">io</span>&quot; <span class="c">% &quot;commons-io&quot; % 
&quot;2.4&quot;,</span>
+  &quot;<span class="n">org</span><span class="p">.</span><span 
class="n">apache</span><span class="p">.</span><span 
class="n">mahout</span>&quot; <span class="c">% 
&quot;mahout-math-scala_2.10&quot; % &quot;0.10.0&quot;,</span>
+  &quot;<span class="n">org</span><span class="p">.</span><span 
class="n">apache</span><span class="p">.</span><span 
class="n">mahout</span>&quot; <span class="c">% &quot;mahout-spark_2.10&quot; % 
&quot;0.10.0&quot;,</span>
+  &quot;<span class="n">org</span><span class="p">.</span><span 
class="n">apache</span><span class="p">.</span><span 
class="n">mahout</span>&quot; <span class="c">% &quot;mahout-math&quot; % 
&quot;0.10.0&quot;,</span>
+  &quot;<span class="n">org</span><span class="p">.</span><span 
class="n">apache</span><span class="p">.</span><span 
class="n">mahout</span>&quot; <span class="c">% &quot;mahout-hdfs&quot; % 
&quot;0.10.0&quot;,</span>
+  <span class="o">//</span> <span class="n">Google</span> <span 
class="n">collections</span><span class="p">,</span> <span class="n">AKA</span> 
<span class="n">Guava</span>
+  &quot;<span class="n">com</span><span class="p">.</span><span 
class="n">google</span><span class="p">.</span><span 
class="n">guava</span>&quot; <span class="c">% &quot;guava&quot; % 
&quot;16.0&quot;)</span>
+
+<span class="n">resolvers</span> <span class="o">+</span><span 
class="p">=</span> &quot;<span class="n">typesafe</span> <span 
class="n">repo</span>&quot; <span class="n">at</span> &quot; <span 
class="n">http</span><span class="p">:</span><span class="o">//</span><span 
class="n">repo</span><span class="p">.</span><span 
class="n">typesafe</span><span class="p">.</span><span 
class="n">com</span><span class="o">/</span><span 
class="n">typesafe</span><span class="o">/</span><span 
class="n">releases</span><span class="o">/</span>&quot;
+
+<span class="n">resolvers</span> <span class="o">+</span><span 
class="p">=</span> <span class="n">Resolver</span><span class="p">.</span><span 
class="n">mavenLocal</span>
+
+<span class="n">packSettings</span>
+
+<span class="n">packMain</span> <span class="p">:=</span> <span 
class="n">Map</span><span class="p">(</span>
+  &quot;<span class="n">cooc</span>&quot; <span class="o">-&gt;</span> 
&quot;<span class="n">CooccurrenceDriver</span>&quot;<span class="p">)</span>
+</pre></div>
+
+
 <h2 id="build">Build</h2>
 <p>Building the examples from project's root folder:</p>
 <div class="codehilite"><pre>$ <span class="n">sbt</span> <span 
class="n">pack</span>

svn commit: r948531 - in /websites/staging/mahout/trunk/content: ./ users/environment/how-to-build-an-app.html

Reply via email to