how-to-build-an-app.html

buildbot Mon, 20 Apr 2015 17:17:06 -0700

Author: buildbot
Date: Tue Apr 21 00:16:14 2015
New Revision: 948527

Log:
Staging update by buildbot for mahout


Modified:
    websites/staging/mahout/trunk/content/   (props changed)
    
websites/staging/mahout/trunk/content/users/environment/how-to-build-an-app.html

Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Tue Apr 21 00:16:14 2015
@@ -1 +1 @@
-1674993
+1675008

Modified: 
websites/staging/mahout/trunk/content/users/environment/how-to-build-an-app.html
==============================================================================
--- 
websites/staging/mahout/trunk/content/users/environment/how-to-build-an-app.html
 (original)
+++ 
websites/staging/mahout/trunk/content/users/environment/how-to-build-an-app.html
 Tue Apr 21 00:16:14 2015
@@ -261,18 +261,118 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <h1 id="multiple-indicator-creation">Multiple Indicator Creation</h1>
-<p>This is an example of how to create more that two indictors from more than 
two user interaction types with Mahout. We will use very simple hand created 
example data for one might see in an ecommerce application. The application 
records three interactions for item-purchase, item-detail-view, and 
category-preference (search for or click on a category). </p>
-<p><em>spark-itemsimilarity</em> will handle two inputs but here we have three 
and rather than running <em>spark-itemsimilarity</em> twice we will create our 
own app to do it.</p>
+    <h1 id="how-to-create-and-app-using-mahout">How to create and App using 
Mahout</h1>
+<p>This is an example of how to create a simple app using Mahout as a Library. 
The source is available on Github in the <a 
href="https://github.com/pferrel/3-input-cooc";>3-input-cooc project</a> with 
more explanation about what it does. For this tutorial we'll concentrate on how 
to create an app.</p>
+<p>This example is for reading three interactions types and creating 
indicators for them using cooccurrence and cross-cooccurrence. The indicators 
will be written to text files in a format ready for search engine indexing in 
recommender.</p>
 <h2 id="setup">Setup</h2>
 <p>In order to build and run the CooccurrenceDriver youÂ need to install the 
following:</p>
 <ul>
 <li>Install the Java 7 JDK from Oracle. Mac users look here: <a 
href="http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html";>Java
 SE Development Kit 7u72</a>.</li>
-<li>Install sbt (simple build tool) 0.13.x for <a 
href="Installing-sbt-on-Mac.html">Mac</a>, <a 
href="Installing-sbt-on-Windows.html">Windows</a>,
-<a href="Installing-sbt-on-Linux.html">Linux</a>,  or
-<a href="Manual-Installation.html">manual installation</a>.</li>
+<li>Install sbt (simple build tool) 0.13.x for <a 
href="http://www.scala-sbt.org/release/tutorial/Installing-sbt-on-Mac.html";>Mac</a>,<a
 
href="http://www.scala-sbt.org/release/tutorial/Installing-sbt-on-Linux.html";>Linux</a>
 or <a 
href="http://www.scala-sbt.org/release/tutorial/Manual-Installation.html";>manual
 instalation</a>.</li>
 <li>Install <a 
href="http://mahout.apache.org/general/downloads.html";>Mahout</a>. Don't forget 
to setup MAHOUT_HOME and MAHOUT_LOCAL</li>
 </ul>
+<h2 id="application">Application</h2>
+<p>Using Mahout as a library in an application will require a little Scala 
code. We have an App trait in Scala so we'll create an object, which inherits 
from <code>App</code></p>
+<p><code>object CooccurrenceDriver extends App {
+}</code>
+This will look a little different than Java since <code>App</code> does 
delayed initialization, which causes the main body to be executed when the App 
is launched, just as in Java you would create a CooccurrenceDriver.main.</p>
+<p>Before we can execute something on Spark we'll need to create a context. We 
could use raw Spark calls here but default values are setup for a Ma  // strip 
off names, which only takes and array of IndexedDatasets
+  val indicatorMatrices = SimilarityAnalysis.cooccurrencesIDSs(actions.map(a 
=&gt; a._2))
+hout context.</p>
+<p><code>implicit val mc = mahoutSparkContext(masterUrl = "local", appName = 
"2-input-cooc")</code>
+We need to read in three files containing different interaction types. The 
files will each be read into a Mahout IndexedDataset. This allows us to 
preserve application-specific user and item IDs throughout the calculations.</p>
+<p>For example, here is data/purchase.csv:</p>
+<p>```
+u1,iphone
+u1,ipad
+u2,nexus
+u2,galaxy
+u3,surface
+u4,iphone
+u4,galaxy</p>
+<p>```
+Mahout has a helper function that reads the text delimited in 
SparkEngine.indexedDatasetDFSReadElements. The function reads single elements 
in a distributed way to create the IndexedDataset. </p>
+<p>Notice we read in all datasets before we adjust the number of rows in them 
to match the total number of users in the data. This is so the math works out 
even if some users took one action but not another.</p>
+<p>```
+/*<em>
+ * Read files of element tuples and create IndexedDatasets one per action. 
These share a userID BiMap but have
+ * their own itemID BiMaps
+ </em>/
+def readActions(actionInput: Array[(String, String)]): Array[(String, 
IndexedDataset)] = {
+  var actions = Array<a href="">(String, IndexedDataset)</a></p>
+<p>val userDictionary: BiMap[String, Int] = HashBiMap.create()</p>
+<p>// The first action named in the sequence is the "primary" action and 
+  // begins to fill up the user dictionary
+  for ( actionDescription &lt;- actionInput ) {// grab the path to actions
+    val action: IndexedDataset = SparkEngine.indexedDatasetDFSReadElements(
+      actionDescription._2,
+      schema = DefaultIndexedDatasetElementReadSchema,
+      existingRowIDs = userDictionary)
+    userDictionary.putAll(action.rowIDs)
+    // put the name in the tuple with the indexedDataset
+    actions = actions :+ (actionDescription._1, action) 
+  }</p>
+<p>// After all actions are read in the userDictonary will contain every user 
seen, 
+  // even if they may not have taken all actions . Now we adjust the row rank 
of 
+  // all IndxedDataset's to have this number of rows
+  // Note: this is very important or the cooccurrence calc may fail
+  val numUsers = userDictionary.size() // one more than the cardinality</p>
+<p>val resizedNameActionPairs = actions.map { a =&gt;
+    //resize the matrix by, in effect by adding empty rows
+    val resizedMatrix = a._2.create(a._2.matrix, userDictionary, 
a._2.columnIDs).newRowCardinality(numUsers)
+    (a._1, resizedMatrix) // return the Tuple of (name, IndexedDataset)
+  }
+  resizedNameActionPairs // return the array of Tuples
+}</p>
+<p>```</p>
+<p>Now that we have the data read in we can perform the cooccurrence 
calculation.</p>
+<p>```
+// strip off names, which only takes and array of IndexedDatasets
+val indicatorMatrices = SimilarityAnalysis.cooccurrencesIDSs(actions.map(a 
=&gt; a._2))</p>
+<p>```</p>
+<p>All we need to do now is write the indicators.</p>
+<p><code>// zip a pair of arrays into an array of pairs, reattaching the 
action names
+val indicatorDescriptions = actions.map(a =&gt; a._1).zip(indicatorMatrices)
+writeIndicators(indicatorDescriptions)</code></p>
+<p>The <code>writeIndicators</code> method uses the default write function 
<code>dfsWrite</code>.</p>
+<p>```
+/*<em>
+ * Write indicatorMatrices to the output dir in the default format
+ </em>/
+def writeIndicators( indicators: Array[(String, IndexedDataset)]) = {
+  for (indicator &lt;- indicators ) {
+    val indicatorDir = OutputPath + indicator._1
+    indicator._2.dfsWrite(
+      indicatorDir, // do we have to remove the last $ char?
+      // omit LLR strengths and format for search engine indexing
+      IndexedDatasetWriteBooleanSchema) 
+  }
+}</p>
+<p>```</p>
+<p>See the Github project for the full source. Now we create a build.sbt to 
build the example. </p>
+<p>```
+name := "cooccurrence-driver"</p>
+<p>organization := "com.finderbots"</p>
+<p>version := "0.1"</p>
+<p>scalaVersion := "2.10.4"</p>
+<p>val sparkVersion = "1.1.1"</p>
+<p>libraryDependencies ++= Seq(
+  "log4j" % "log4j" % "1.2.17",
+  // Mahout's Spark code
+  "commons-io" % "commons-io" % "2.4",
+  "org.apache.mahout" % "mahout-math-scala_2.10" % "0.10.0",
+  "org.apache.mahout" % "mahout-spark_2.10" % "0.10.0",
+  "org.apache.mahout" % "mahout-math" % "0.10.0",
+  "org.apache.mahout" % "mahout-hdfs" % "0.10.0",
+  // Google collections, AKA Guava
+  "com.google.guava" % "guava" % "16.0")</p>
+<p>resolvers += "typesafe repo" at " 
http://repo.typesafe.com/typesafe/releases/";</p>
+<p>resolvers += Resolver.mavenLocal</p>
+<p>packSettings</p>
+<p>packMain := Map(
+  "cooc" -&gt; "CooccurrenceDriver"
+)</p>
+<p>```</p>
 <h2 id="build">Build</h2>
 <p>Building the examples from project's root folder:</p>
 <div class="codehilite"><pre>$ <span class="n">sbt</span> <span 
class="n">pack</span>
@@ -284,23 +384,7 @@
 </pre></div>
 
 
-<p>The driver will execute in Spark standalone mode one the provided sample 
data and output log information including various information about the input 
data. The output will be in 
/path/to/3-input-cooc/data/indicators/<em>indicator-type</em></p>
-<h2 id="cooccurrencedriver">CooccurrenceDriver</h2>
-<p>This driver takes three actions in three separate input files. The input is 
in tuple form (user-id,item-id) one per line. It calculates all cooccurrence 
and cross-cooccurrence indicators. The sample actions are trivial hand made 
examples with somewhat intuitive data.</p>
-<p>Actions:</p>
-<ol>
-<li><strong>Purchase</strong>: user purchases</li>
-<li><strong>View</strong>: user product details views</li>
-<li><strong>Category</strong>: user preference for category tags</li>
-</ol>
-<p>Indicators:</p>
-<ol>
-<li><strong>Purchase cooccurrence</strong>: may be interpretted as a list if 
similar items for each item. Similar in terms of which users purchased 
them.</li>
-<li><strong>View cross-cooccurrence</strong>: may be interpretted as a list of 
similar items in terms of which users viewed the item where the view led to a 
purchase.</li>
-<li><strong>Category cross-cooccurrence</strong>: may be interpretted as a 
list of similar categories in terms of which users preferred the category and 
this led to a purchase.</li>
-</ol>
-<h2 id="data">Data</h2>
-<p>Mahout has reader traits that will read text delimited files. Input for 
<em>spark-itemsimilarity</em> and this CooccurrenceDriver are tuples of 
(user-id,item-id) with one line per tuple. The inputs for CooccurrenceDriver 
are files but in <em>spark-itemsimilarity</em> they may be directories of 
"part-xxxxx" files. These can be found in the <code>data</code> directory.</p>
+<p>The driver will execute in Spark standalone mode and put the data in 
/path/to/3-input-cooc/data/indicators/<em>indicator-type</em></p>
 <h2 id="using-a-debugger">Using a Debugger</h2>
 <p>To build and run this example in a debugger like IntelliJ IDEA. Install 
from the IntelliJ site and add the Scala plugin.</p>
 <p>Open IDEA and go to the menu File-&gt;New-&gt;Project from existing 
sources-&gt;SBT-&gt;/path/to/3-input-cooc. This will create an IDEA project 
from <code>build.sbt</code> in the root directory.</p>
@@ -309,6 +393,9 @@
 <p>Now choose "Application" in the left pane and hit the plus sign "+". give 
the config a name and hit the elipsis button to the right of the "Main class" 
field as shown.</p>
 <p><img alt="image" src="http://mahout.apache.org/images/debug-config-2.png"; 
title="=600x" /></p>
 <p>After setting breakpoints you are now ready to debug the configuration. Go 
to the Run-&gt;Debug... menu and pick your configuration. This will execute 
using a local standalone instance of Spark.</p>
+<h2 id="the-mahout-shell">The Mahout Shell</h2>
+<p>For small script-like apps you may wish to use the Mahout shell. It is a 
Scala REPL type interactive shell built on the Spark shell with Mahout-Samsara 
extensions.</p>
+<p>For the shell you won't need the context, since it is created when the 
shell is launched. To control the configuration of Mahout and Spark we set 
environment variables. </p>
    </div>
   </div>     
 </div>

svn commit: r948527 - in /websites/staging/mahout/trunk/content: ./ users/environment/how-to-build-an-app.html

Reply via email to