how-to-build-an-app.mdtext

pat Mon, 20 Apr 2015 17:23:11 -0700

Author: pat
Date: Tue Apr 21 00:22:43 2015
New Revision: 1675009

URL: http://svn.apache.org/r1675009
Log:
CMS commit to mahout by pat


Modified:
    
mahout/site/mahout_cms/trunk/content/users/environment/how-to-build-an-app.mdtext

Modified: 
mahout/site/mahout_cms/trunk/content/users/environment/how-to-build-an-app.mdtext
URL: 
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/environment/how-to-build-an-app.mdtext?rev=1675009&r1=1675008&r2=1675009&view=diff
==============================================================================
--- 
mahout/site/mahout_cms/trunk/content/users/environment/how-to-build-an-app.mdtext
 (original)
+++ 
mahout/site/mahout_cms/trunk/content/users/environment/how-to-build-an-app.mdtext
 Tue Apr 21 00:22:43 2015
@@ -14,144 +14,135 @@ In order to build and run the Cooccurren
 ##Application
 Using Mahout as a library in an application will require a little Scala code. 
We have an App trait in Scala so we'll create an object, which inherits from 
```App```
 
-```
-object CooccurrenceDriver extends App {
-}
-```
+
+    object CooccurrenceDriver extends App {
+    }
+    
+
 This will look a little different than Java since ```App``` does delayed 
initialization, which causes the main body to be executed when the App is 
launched, just as in Java you would create a CooccurrenceDriver.main.
 
 Before we can execute something on Spark we'll need to create a context. We 
could use raw Spark calls here but default values are setup for a Ma  // strip 
off names, which only takes and array of IndexedDatasets
   val indicatorMatrices = SimilarityAnalysis.cooccurrencesIDSs(actions.map(a 
=> a._2))
 hout context.
 
-```
-implicit val mc = mahoutSparkContext(masterUrl = "local", appName = 
"2-input-cooc")
-```
+
+    implicit val mc = mahoutSparkContext(masterUrl = "local", appName = 
"2-input-cooc")
+    
 We need to read in three files containing different interaction types. The 
files will each be read into a Mahout IndexedDataset. This allows us to 
preserve application-specific user and item IDs throughout the calculations.
 
 For example, here is data/purchase.csv:
 
-```
-u1,iphone
-u1,ipad
-u2,nexus
-u2,galaxy
-u3,surface
-u4,iphone
-u4,galaxy
 
-```
+    u1,iphone
+    u1,ipad
+    u2,nexus
+    u2,galaxy
+    u3,surface
+    u4,iphone
+    u4,galaxy
+
 Mahout has a helper function that reads the text delimited in 
SparkEngine.indexedDatasetDFSReadElements. The function reads single elements 
in a distributed way to create the IndexedDataset. 
 
 Notice we read in all datasets before we adjust the number of rows in them to 
match the total number of users in the data. This is so the math works out even 
if some users took one action but not another.
 
-```
-/**
- * Read files of element tuples and create IndexedDatasets one per action. 
These share a userID BiMap but have
- * their own itemID BiMaps
- */
-def readActions(actionInput: Array[(String, String)]): Array[(String, 
IndexedDataset)] = {
-  var actions = Array[(String, IndexedDataset)]()
-
-  val userDictionary: BiMap[String, Int] = HashBiMap.create()
-
-  // The first action named in the sequence is the "primary" action and 
-  // begins to fill up the user dictionary
-  for ( actionDescription <- actionInput ) {// grab the path to actions
-    val action: IndexedDataset = SparkEngine.indexedDatasetDFSReadElements(
-      actionDescription._2,
-      schema = DefaultIndexedDatasetElementReadSchema,
-      existingRowIDs = userDictionary)
-    userDictionary.putAll(action.rowIDs)
-    // put the name in the tuple with the indexedDataset
-    actions = actions :+ (actionDescription._1, action) 
-  }
-
-  // After all actions are read in the userDictonary will contain every user 
seen, 
-  // even if they may not have taken all actions . Now we adjust the row rank 
of 
-  // all IndxedDataset's to have this number of rows
-  // Note: this is very important or the cooccurrence calc may fail
-  val numUsers = userDictionary.size() // one more than the cardinality
-
-  val resizedNameActionPairs = actions.map { a =>
-    //resize the matrix by, in effect by adding empty rows
-    val resizedMatrix = a._2.create(a._2.matrix, userDictionary, 
a._2.columnIDs).newRowCardinality(numUsers)
-    (a._1, resizedMatrix) // return the Tuple of (name, IndexedDataset)
-  }
-  resizedNameActionPairs // return the array of Tuples
-}
+    /**
+     * Read files of element tuples and create IndexedDatasets one per action. 
These share     a userID BiMap but have
+     * their own itemID BiMaps
+     */
+    def readActions(actionInput: Array[(String, String)]): Array[(String, 
IndexedDataset)] = {
+      var actions = Array[(String, IndexedDataset)]()
+
+      val userDictionary: BiMap[String, Int] = HashBiMap.create()
+
+      // The first action named in the sequence is the "primary" action and 
+      // begins to fill up the user dictionary
+      for ( actionDescription <- actionInput ) {// grab the path to actions
+        val action: IndexedDataset = SparkEngine.indexedDatasetDFSReadElements(
+          actionDescription._2,
+          schema = DefaultIndexedDatasetElementReadSchema,
+          existingRowIDs = userDictionary)
+        userDictionary.putAll(action.rowIDs)
+        // put the name in the tuple with the indexedDataset
+        actions = actions :+ (actionDescription._1, action) 
+      }
+
+      // After all actions are read in the userDictonary will contain every 
user seen, 
+      // even if they may not have taken all actions . Now we adjust the row 
rank of 
+      // all IndxedDataset's to have this number of rows
+      // Note: this is very important or the cooccurrence calc may fail
+      val numUsers = userDictionary.size() // one more than the cardinality
+
+      val resizedNameActionPairs = actions.map { a =>
+        //resize the matrix by, in effect by adding empty rows
+        val resizedMatrix = a._2.create(a._2.matrix, userDictionary, 
a._2.columnIDs).newRowCardinality(numUsers)
+        (a._1, resizedMatrix) // return the Tuple of (name, IndexedDataset)
+      }
+      resizedNameActionPairs // return the array of Tuples
+    }
 
-```
 
 Now that we have the data read in we can perform the cooccurrence calculation.
 
-```
-// strip off names, which only takes and array of IndexedDatasets
-val indicatorMatrices = SimilarityAnalysis.cooccurrencesIDSs(actions.map(a => 
a._2))
 
-```
+    // strip off names, which only takes and array of IndexedDatasets
+    val indicatorMatrices = SimilarityAnalysis.cooccurrencesIDSs(actions.map(a 
=> a._2))
+
 
 All we need to do now is write the indicators.
 
-```
-// zip a pair of arrays into an array of pairs, reattaching the action names
-val indicatorDescriptions = actions.map(a => a._1).zip(indicatorMatrices)
+    // zip a pair of arrays into an array of pairs, reattaching the action 
names
+    val indicatorDescriptions = actions.map(a => a._1).zip(indicatorMatrices)
 writeIndicators(indicatorDescriptions)
-```
+
 
 The ```writeIndicators``` method uses the default write function 
```dfsWrite```.
 
-```
-/**
- * Write indicatorMatrices to the output dir in the default format
- */
-def writeIndicators( indicators: Array[(String, IndexedDataset)]) = {
-  for (indicator <- indicators ) {
-    val indicatorDir = OutputPath + indicator._1
-    indicator._2.dfsWrite(
-      indicatorDir, // do we have to remove the last $ char?
-      // omit LLR strengths and format for search engine indexing
-      IndexedDatasetWriteBooleanSchema) 
-  }
-}
+    /**
+     * Write indicatorMatrices to the output dir in the default format
+     */
+    def writeIndicators( indicators: Array[(String, IndexedDataset)]) = {
+      for (indicator <- indicators ) {
+        val indicatorDir = OutputPath + indicator._1
+        indicator._2.dfsWrite(
+          indicatorDir, // do we have to remove the last $ char?
+          // omit LLR strengths and format for search engine indexing
+          IndexedDatasetWriteBooleanSchema) 
+      }
+    }
  
-```
 
 See the Github project for the full source. Now we create a build.sbt to build 
the example. 
 
-```
-name := "cooccurrence-driver"
+    name := "cooccurrence-driver"
 
-organization := "com.finderbots"
+    organization := "com.finderbots"
 
-version := "0.1"
+    version := "0.1"
 
-scalaVersion := "2.10.4"
+    scalaVersion := "2.10.4"
 
-val sparkVersion = "1.1.1"
+    val sparkVersion = "1.1.1"
 
-libraryDependencies ++= Seq(
-  "log4j" % "log4j" % "1.2.17",
-  // Mahout's Spark code
-  "commons-io" % "commons-io" % "2.4",
-  "org.apache.mahout" % "mahout-math-scala_2.10" % "0.10.0",
-  "org.apache.mahout" % "mahout-spark_2.10" % "0.10.0",
-  "org.apache.mahout" % "mahout-math" % "0.10.0",
-  "org.apache.mahout" % "mahout-hdfs" % "0.10.0",
-  // Google collections, AKA Guava
-  "com.google.guava" % "guava" % "16.0")
+    libraryDependencies ++= Seq(
+      "log4j" % "log4j" % "1.2.17",
+      // Mahout's Spark code
+      "commons-io" % "commons-io" % "2.4",
+      "org.apache.mahout" % "mahout-math-scala_2.10" % "0.10.0",
+      "org.apache.mahout" % "mahout-spark_2.10" % "0.10.0",
+      "org.apache.mahout" % "mahout-math" % "0.10.0",
+      "org.apache.mahout" % "mahout-hdfs" % "0.10.0",
+      // Google collections, AKA Guava
+      "com.google.guava" % "guava" % "16.0")
 
-resolvers += "typesafe repo" at " http://repo.typesafe.com/typesafe/releases/";
+    resolvers += "typesafe repo" at " 
http://repo.typesafe.com/typesafe/releases/";
 
-resolvers += Resolver.mavenLocal
+    resolvers += Resolver.mavenLocal
 
-packSettings
+    packSettings
 
-packMain := Map(
-  "cooc" -> "CooccurrenceDriver"
-)
+    packMain := Map(
+      "cooc" -> "CooccurrenceDriver")
 
-```
 
 ##Build
 Building the examples from project's root folder:

svn commit: r1675009 - /mahout/site/mahout_cms/trunk/content/users/environment/how-to-build-an-app.mdtext

Reply via email to