WEBSITE Adding Docs
Project: http://git-wip-us.apache.org/repos/asf/mahout/repo Commit: http://git-wip-us.apache.org/repos/asf/mahout/commit/0b38f516 Tree: http://git-wip-us.apache.org/repos/asf/mahout/tree/0b38f516 Diff: http://git-wip-us.apache.org/repos/asf/mahout/diff/0b38f516 Branch: refs/heads/master Commit: 0b38f5167297b3654f1f5fa5e79ce6e8d086ae53 Parents: c8bdf2e Author: rawkintrevo <[email protected]> Authored: Wed May 3 00:08:20 2017 -0500 Committer: rawkintrevo <[email protected]> Committed: Wed May 3 00:08:20 2017 -0500 ---------------------------------------------------------------------- website/docs/_includes/algo_navbar.html | 23 +- website/docs/_includes/mr_tutorial_navbar.html | 9 + website/docs/_includes/navbar.html | 24 +- website/docs/algorithms/linear-algebra/index.md | 16 ++ .../algorithms/regression/fittness-tests.md | 17 ++ website/docs/algorithms/regression/index.md | 21 ++ website/docs/algorithms/regression/ols.md | 61 +++- .../serial-correlation/cochrane-orcutt.md | 11 +- .../regression/serial-correlation/dw-test.md | 18 ++ .../docs/tutorials/cco-lastfm/cco-lastfm.scala | 58 ++-- .../docs/tutorials/mahout-in-zeppelin/index.md | 276 +++++++++++++++++++ .../tutorials/mahout-in-zeppelin/zeppelin1.png | Bin 0 -> 50936 bytes .../tutorials/mahout-in-zeppelin/zeppelin2.png | Bin 0 -> 46906 bytes .../tutorials/mahout-in-zeppelin/zeppelin3.png | Bin 0 -> 68551 bytes website/front/community/blogs.md | 5 + 15 files changed, 488 insertions(+), 51 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/mahout/blob/0b38f516/website/docs/_includes/algo_navbar.html ---------------------------------------------------------------------- diff --git a/website/docs/_includes/algo_navbar.html b/website/docs/_includes/algo_navbar.html index 831f48d..92fd4c8 100644 --- a/website/docs/_includes/algo_navbar.html +++ b/website/docs/_includes/algo_navbar.html @@ -1,6 +1,7 @@ <div id="AlgoMenu"> + <span><b>Mahout-Samsara Algorithms</b></span> <div class="list-group panel"> - <a href="#linalg" class="list-group-item list-group-item-success" data-toggle="collapse" data-parent="#AlgoMenu"><b>Linear Algebra Algorithms</b><i class="fa fa-caret-down"></i></a> + <a href="#linalg" class="list-group-item list-group-item-success" data-toggle="collapse" data-parent="#AlgoMenu"><b>Linear Algebra</b><i class="fa fa-caret-down"></i></a> <div class="collapse" id="linalg"> <ul class="nav sidebar-nav"> <li> <a href="{{ BASE_PATH }}/algorithms/linear-algebra/d-als.html">Distributed Alternating Least Squares</a></li> @@ -17,5 +18,25 @@ <li> <a href="{{ BASE_PATH }}/algorithms/preprocessors/MeanCenter.html">MeanCenter</a></li> </ul> </div> + <a href="#regression" class="list-group-item list-group-item-success" data-toggle="collapse" data-parent="#AlgoMenu"><b>Regression</b><i class="fa fa-caret-down"></i></a> + <div class="collapse" id="regression"> + <ul class="nav sidebar-nav"> + <a href="#serial-correlation" class="list-group-item list-group-item-success" data-toggle="collapse" data-parent="#regression"><b>• Serial Correlation</b><i class="fa fa-caret-down"></i></a> + <div class="collapse" id="serial-correlation"> + <ul class="nav sidebar-nav"> + <li> <a href="{{ BASE_PATH }}/algorithms/regression/serial-correlation/cochrane-orcutt.html">Cochrane-Orcutt Procedure</a></li> + <li> <a href="{{ BASE_PATH }}/algorithms/regression/serial-correlation/dw-test.html">Durbin Watson Test</a></li> + </ul> + </div> + <li> <a href="{{ BASE_PATH }}/algorithms/regression/ols.html">Ordinary Least Squares (Closed Form)</a></li> + <li> <a href="{{ BASE_PATH }}/algorithms/regression/fittness-tests.html">Fitness Tests</a></li> + </ul> + </div> + <a href="#reccomenders" class="list-group-item list-group-item-success" data-toggle="collapse" data-parent="#AlgoMenu"><b>Reccomenders</b><i class="fa fa-caret-down"></i></a> + <div class="collapse" id="reccomenders"> + <ul class="nav sidebar-nav"> + <li> <a href="{{ BASE_PATH }}/algorithms/reccomenders">Reccomender Overview</a></li> + </ul> + </div> </div> </div> http://git-wip-us.apache.org/repos/asf/mahout/blob/0b38f516/website/docs/_includes/mr_tutorial_navbar.html ---------------------------------------------------------------------- diff --git a/website/docs/_includes/mr_tutorial_navbar.html b/website/docs/_includes/mr_tutorial_navbar.html index b9f0140..cd268f5 100644 --- a/website/docs/_includes/mr_tutorial_navbar.html +++ b/website/docs/_includes/mr_tutorial_navbar.html @@ -8,6 +8,7 @@ <li> <a href="{{ BASE_PATH }}/tutorials/map-reduce/classification/breiman-example.html">Breiman Example</a></li> <li> <a href="{{ BASE_PATH }}/tutorials/map-reduce/classification/twenty-newsgroups.html">Twenty Newsgroups Example</a></li> <li> <a href="{{ BASE_PATH }}/tutorials/map-reduce/classification/wikipedia-classifier-example.html">Wikipedia Classifier Example</a></li> + <li> <a href="{{ BASE_PATH }}/tutorials/map-reduce/classification/parallel-frequent-pattern-mining.html">Parallel Frequent Pattern Mining</a></li> </ul> </div> <a href="#clustering" class="list-group-item list-group-item-success" data-toggle="collapse" data-parent="#MrTutorialMenu"><b>Clustering</b><i class="fa fa-caret-down"></i></a> @@ -25,5 +26,13 @@ <li> <a href="{{ BASE_PATH }}/tutorials/map-reduce/clustering/visualizing-sample-clusters.html">Visualizing Sample Clusters</a></li> </ul> </div> + <a href="#misc" class="list-group-item list-group-item-success" data-toggle="collapse" data-parent="#MrTutorialMenu"><b>Miscelaneous</b><i class="fa fa-caret-down"></i></a> + <div class="collapse" id="misc"> + <ul class="nav sidebar-nav"> + <li> <a href="{{ BASE_PATH }}/tutorials/map-reduce/misc/mr---map-reduce.html">MR Map-Reduce</a></li> + <li> <a href="{{ BASE_PATH }}/tutorials/map-reduce/misc/parallel-frequent-pattern-mining.html">Parallel Frequent Pattern Mining</a></li> + <li> <a href="{{ BASE_PATH }}/tutorials/map-reduce/misc/using-mahout-with-python-via-jpype.html">Using Mahout (Map Reduce) with Python via Jpype</a></li> + </ul> + </div> </div> </div> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/0b38f516/website/docs/_includes/navbar.html ---------------------------------------------------------------------- diff --git a/website/docs/_includes/navbar.html b/website/docs/_includes/navbar.html index df2c6c5..1d398e3 100644 --- a/website/docs/_includes/navbar.html +++ b/website/docs/_includes/navbar.html @@ -53,25 +53,21 @@ <li id="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">Algorithms<span class="caret"></span></a> <ul class="dropdown-menu"> - <li><span><b> Distributed Linear Algebra</b><span></li> - <li><a href="{{ BASE_PATH}}/algorithms/linear-algebra/d-ssvd.html">Distributed SVD</a></li> - <li><a href="{{ BASE_PATH}}/algorithms/linear-algebra/d-spca.html">Distributed PCA</a></li> - <li><a href="{{ BASE_PATH}}/algorithms/linear-algebra/d-qr.html">Distributed QR-Decomposition </a></li> - <li><a href="{{ BASE_PATH}}/algorithms/linear-algebra/d-als.html">Distributed ALS</a></li> + <li><a href="{{ BASE_PATH}}/algorithms/linear-algebra">Distributed Linear Algebra</a></li> + <li><a href="{{ BASE_PATH}}/algorithms/preprocessors">Preprocessors</a></li> + <li><a href="{{ BASE_PATH}}/algorithms/regression">Regression</a></li> + <li><a href="{{ BASE_PATH}}/algorithms/reccomenders">Reccomenders</a></li> <li role="separator" class="divider"></li> - <li><span><b> Regression </b><span></li> - <li><a href="{{ BASE_PATH}}/algorithms/regression/ols.html">Ordinary Least Squares</a></li> - <li><a href="{{ BASE_PATH}}/algorithms/regression/cochrane-orcutt.html">Cochrane-Orcutt Procedure</a></li> - <li role="separator" class="divider"></li> - <li><span><b> Preprocessors</b><span></li> - <li><a href="{{ BASE_PATH}}/algorithms/preprocessors/AsFactor.html">AsFactor (a.k.a. "one-hot encoding")</a></li> - <li role="separator" class="divider"></li> - <li><span><b> Reccomenders</b><span></li> + <li><a href="{{ BASE_PATH }}/algorithms/map-reduce">MapReduce <i>(deprecated)</i></a></li> + </ul> <!--<li><a href="{{ BASE_PATH}}/algorithms/reccomenders/recommender-overview.html">Reccomender Overview</a></li> Do we still need? seems like short version of next post--> + <!-- <li><a href="{{ BASE_PATH}}/algorithms/reccomenders/intro-cooccurrence-spark.html">Intro to Coocurrence With Spark</a></li> <li role="separator" class="divider"></li> <li><span> <a href="{{ BASE_PATH }}/algorithms/map-reduce"><b>MapReduce</b> (deprecated)</a><span></li> - </ul> + + + --> </li> <!-- Scala Docs --> http://git-wip-us.apache.org/repos/asf/mahout/blob/0b38f516/website/docs/algorithms/linear-algebra/index.md ---------------------------------------------------------------------- diff --git a/website/docs/algorithms/linear-algebra/index.md b/website/docs/algorithms/linear-algebra/index.md index e69de29..e42978a 100644 --- a/website/docs/algorithms/linear-algebra/index.md +++ b/website/docs/algorithms/linear-algebra/index.md @@ -0,0 +1,16 @@ +--- +layout: algorithm + +title: Distributed Linear Algebra +theme: + name: retro-mahout +--- + +Mahout has a number of distributed linear algebra "algorithms" that, in concert with the mathematically expressive R-Like Scala DSL, make it possible for users to quickly "roll their own" distributed algorithms. + +[Distributed QR Decomposition](d-qr.html) + +[Distributed Stochastic Principal Component Analysis](d-spca.html) + +[Distributed Stochastic Singular Value Decomposition](d-ssvd.html) + http://git-wip-us.apache.org/repos/asf/mahout/blob/0b38f516/website/docs/algorithms/regression/fittness-tests.md ---------------------------------------------------------------------- diff --git a/website/docs/algorithms/regression/fittness-tests.md b/website/docs/algorithms/regression/fittness-tests.md index e69de29..1bd8984 100644 --- a/website/docs/algorithms/regression/fittness-tests.md +++ b/website/docs/algorithms/regression/fittness-tests.md @@ -0,0 +1,17 @@ +--- +layout: algorithm +title: Regression Fitness Tests +theme: + name: mahout2 +--- + +TODO: Fill this out! +Stub + +### About + +### Parameters + +### Example + + http://git-wip-us.apache.org/repos/asf/mahout/blob/0b38f516/website/docs/algorithms/regression/index.md ---------------------------------------------------------------------- diff --git a/website/docs/algorithms/regression/index.md b/website/docs/algorithms/regression/index.md index e69de29..3639c4f 100644 --- a/website/docs/algorithms/regression/index.md +++ b/website/docs/algorithms/regression/index.md @@ -0,0 +1,21 @@ +--- +layout: algorithm + +title: Regressoin Algorithms +theme: + name: retro-mahout +--- + +Apache Mahout implements the following regression algorithms "off the shelf". + +### Closed Form Solutions + +These methods used close form solutions (not stochastic) to solve regression problems + +[Ordinary Least Squares](ols.html) + +### Autocorrelation Regression + +Serial Correlation of the error terms can lead to biased estimates of regression parameters, the following remedial procedures are provided: + +[Cochrane Orcutt Procedure](cochrane-orcutt.html) http://git-wip-us.apache.org/repos/asf/mahout/blob/0b38f516/website/docs/algorithms/regression/ols.md ---------------------------------------------------------------------- diff --git a/website/docs/algorithms/regression/ols.md b/website/docs/algorithms/regression/ols.md index 5c16d1f..58c74c5 100644 --- a/website/docs/algorithms/regression/ols.md +++ b/website/docs/algorithms/regression/ols.md @@ -5,5 +5,62 @@ theme: name: mahout2 --- -TODO: Fill this out! -Stub \ No newline at end of file +### About + +The `OrinaryLeastSquares` regressor in Mahout implements a _closed-form_ solution to [Ordinary Least Squares](https://en.wikipedia.org/wiki/Ordinary_least_squares). +This is in stark contrast to many "big data machine learning" frameworks which implement a _stochastic_ approach. From the users perspecive this difference can be reduced to: + +- **_Stochastic_**- A series of guesses at a line line of best fit. +- **_Closed Form_**- A mathimatical approach has been explored, the properties of the parameters are well understood, and problems which arise (and the remedial measures), exist. This is usually the preferred choice of mathematicians/statisticians, but computational limititaions have forced us to resort to SGD. + +### Parameters + +<div class="table-striped"> + <table class="table"> + <tr> + <th>Parameter</th> + <th>Description</th> + <th>Default Value</th> + </tr> + <tr> + <td><code>'calcCommonStatistics</code></td> + <td>Calculate commons statistics such as Coeefficient of Determination and Mean Square Error</td> + <td><code>true</code></td> + </tr> + <tr> + <td><code>'calcStandardErrors</code></td> + <td>Calculate the standard errors (and subsequent "t-scores" and "p-values") of the \(\boldsymbol{\beta}\) estimates</td> + <td><code>true</code></td> + </tr> + <tr> + <td><code>'addIntercept</code></td> + <td>Add an intercept to \(\mathbf{X}\)</td> + <td><code>true</code></td> + </tr> + </table> +</div> + +### Example + +In this example we disable the "calculate common statistics" parameters, so our summary will NOT contain the coefficient of determination (R-squared) or Mean Square Error +```scala +import org.apache.mahout.math.algorithms.regression.OrdinaryLeastSquares + +val drmData = drmParallelize(dense( + (2, 2, 10.5, 10, 29.509541), // Apple Cinnamon Cheerios + (1, 2, 12, 12, 18.042851), // Cap'n'Crunch + (1, 1, 12, 13, 22.736446), // Cocoa Puffs + (2, 1, 11, 13, 32.207582), // Froot Loops + (1, 2, 12, 11, 21.871292), // Honey Graham Ohs + (2, 1, 16, 8, 36.187559), // Wheaties Honey Gold + (6, 2, 17, 1, 50.764999), // Cheerios + (3, 2, 13, 7, 40.400208), // Clusters + (3, 3, 13, 4, 45.811716)), numPartitions = 2) + + +val drmX = drmData(::, 0 until 4) +val drmY = drmData(::, 4 until 5) + +val model = new OrdinaryLeastSquares[Int]().fit(drmX, drmY, 'calcCommonStatistics â false) +println(model.summary) +``` http://git-wip-us.apache.org/repos/asf/mahout/blob/0b38f516/website/docs/algorithms/regression/serial-correlation/cochrane-orcutt.md ---------------------------------------------------------------------- diff --git a/website/docs/algorithms/regression/serial-correlation/cochrane-orcutt.md b/website/docs/algorithms/regression/serial-correlation/cochrane-orcutt.md index a88a0a0..78bff65 100644 --- a/website/docs/algorithms/regression/serial-correlation/cochrane-orcutt.md +++ b/website/docs/algorithms/regression/serial-correlation/cochrane-orcutt.md @@ -4,6 +4,13 @@ title: Cochrane-Orcutt Procedure theme: name: mahout2 --- - TODO: Fill this out! -Stub \ No newline at end of file +Stub + +### About + +### Parameters + +### Example + + http://git-wip-us.apache.org/repos/asf/mahout/blob/0b38f516/website/docs/algorithms/regression/serial-correlation/dw-test.md ---------------------------------------------------------------------- diff --git a/website/docs/algorithms/regression/serial-correlation/dw-test.md b/website/docs/algorithms/regression/serial-correlation/dw-test.md index e69de29..64ca831 100644 --- a/website/docs/algorithms/regression/serial-correlation/dw-test.md +++ b/website/docs/algorithms/regression/serial-correlation/dw-test.md @@ -0,0 +1,18 @@ +--- +layout: algorithm +title: Durbin-Watson Test +theme: + name: mahout2 +--- + +stub +TODO: Fill this out! +Stub + +### About + +### Parameters + +### Example + + http://git-wip-us.apache.org/repos/asf/mahout/blob/0b38f516/website/docs/tutorials/cco-lastfm/cco-lastfm.scala ---------------------------------------------------------------------- diff --git a/website/docs/tutorials/cco-lastfm/cco-lastfm.scala b/website/docs/tutorials/cco-lastfm/cco-lastfm.scala index dc99d57..6ba46a9 100644 --- a/website/docs/tutorials/cco-lastfm/cco-lastfm.scala +++ b/website/docs/tutorials/cco-lastfm/cco-lastfm.scala @@ -1,41 +1,35 @@ - -/** - * Created by rawkintrevo on 4/5/17. - */ - -// Only need these to intelliJ doesn't whine - -import org.apache.mahout.drivers.ItemSimilarityDriver.parser -import org.apache.mahout.math._ -import org.apache.mahout.math.scalabindings._ -import org.apache.mahout.math.drm._ -import org.apache.mahout.math.scalabindings.RLikeOps._ -import org.apache.mahout.math.drm.RLikeDrmOps._ -import org.apache.mahout.sparkbindings._ -import org.apache.spark.SparkContext -import org.apache.spark.SparkContext._ -import org.apache.spark.SparkConf -val conf = new SparkConf().setAppName("Simple Application") -val sc = new SparkContext(conf) - -implicit val sdc: org.apache.mahout.sparkbindings.SparkDistributedContext = sc2sdc(sc) - - -// </pandering to intellij> - -// http://files.grouplens.org/datasets/hetrec2011/hetrec2011-lastfm-2k.zip -// start mahout shell like this: $MAHOUT_HOME/bin/mahout spark-shell +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. +*/ + +/* + * Download data from: http://files.grouplens.org/datasets/hetrec2011/hetrec2011-lastfm-2k.zip + * then run this in the mahout shell. + */ import org.apache.mahout.sparkbindings.indexeddataset.IndexedDatasetSpark // We need to turn our raw text files into RDD[(String, String)] -val userTagsRDD = sc.textFile("/home/rawkintrevo/gits/MahoutExamples/data/lastfm/user_taggedartists.dat").map(line => line.split("\t")).map(a => (a(0), a(2))).filter(_._1 != "userID") +val userTagsRDD = sc.textFile("/path/to/lastfm/user_taggedartists.dat").map(line => line.split("\t")).map(a => (a(0), a(2))).filter(_._1 != "userID") val userTagsIDS = IndexedDatasetSpark.apply(userTagsRDD)(sc) -val userArtistsRDD = sc.textFile("/home/rawkintrevo/gits/MahoutExamples/data/lastfm/user_artists.dat").map(line => line.split("\t")).map(a => (a(0), a(1))).filter(_._1 != "userID") +val userArtistsRDD = sc.textFile("/path/to/lastfm/user_artists.dat").map(line => line.split("\t")).map(a => (a(0), a(1))).filter(_._1 != "userID") val userArtistsIDS = IndexedDatasetSpark.apply(userArtistsRDD)(sc) -val userFriendsRDD = sc.textFile("/home/rawkintrevo/gits/MahoutExamples/data/lastfm/user_friends.dat").map(line => line.split("\t")).map(a => (a(0), a(1))).filter(_._1 != "userID") +val userFriendsRDD = sc.textFile("/path/to/data/lastfm/user_friends.dat").map(line => line.split("\t")).map(a => (a(0), a(1))).filter(_._1 != "userID") val userFriendsIDS = IndexedDatasetSpark.apply(userFriendsRDD)(sc) import org.apache.mahout.math.cf.SimilarityAnalysis @@ -44,8 +38,8 @@ val artistReccosLlrDrmListByArtist = SimilarityAnalysis.cooccurrencesIDSs(Array( // Anonymous User -val artistMap = sc.textFile("/home/rawkintrevo/gits/MahoutExamples/data/lastfm/artists.dat").map(line => line.split("\t")).map(a => (a(1), a(0))).filter(_._1 != "name").collect.toMap -val tagsMap = sc.textFile("/home/rawkintrevo/gits/MahoutExamples/data/lastfm/tags.dat").map(line => line.split("\t")).map(a => (a(1), a(0))).filter(_._1 != "tagValue").collect.toMap +val artistMap = sc.textFile("/path/to/lastfm/artists.dat").map(line => line.split("\t")).map(a => (a(1), a(0))).filter(_._1 != "name").collect.toMap +val tagsMap = sc.textFile("/path/to/lastfm/tags.dat").map(line => line.split("\t")).map(a => (a(1), a(0))).filter(_._1 != "tagValue").collect.toMap // Watch your skin- you're not wearing armour. (This will fail on misspelled artists // This is neccessary because the ids are integer-strings already, and for this demo I didn't want to chance them to Integer types (bc more often you'll have strings). http://git-wip-us.apache.org/repos/asf/mahout/blob/0b38f516/website/docs/tutorials/mahout-in-zeppelin/index.md ---------------------------------------------------------------------- diff --git a/website/docs/tutorials/mahout-in-zeppelin/index.md b/website/docs/tutorials/mahout-in-zeppelin/index.md index e69de29..6362c30 100644 --- a/website/docs/tutorials/mahout-in-zeppelin/index.md +++ b/website/docs/tutorials/mahout-in-zeppelin/index.md @@ -0,0 +1,276 @@ +--- +layout: tutorial +title: Visualizing Mahout in Zeppelin +theme: + name: mahout2 +--- + + +The [Apache Zeppelin](http://zeppelin.apache.org) is an exciting notebooking tool, designed for working with Big Data +applications. It comes with great integration for graphing in R and Python, supports multiple langauges in a single +notebook (and facilitates sharing of variables between interpreters), and makes working with Spark and Flink in an interactive environment (either locally or in cluster mode) a +breeze. Of course, it does lots of other cool things too- but those are the features we're going to take advantage of. + +### Step1: Download and Install Zeppelin + +Zeppelin binaries by default use Spark 2.1 / Scala 2.11, until Mahout puts out Spark 2.1/Scala 2.11 binaries you have +two options. + +#### Option 1: Build Mahout for Spark 2.1/Scala 2.11 + +**Build Mahout** + +Follow the standard procedures for building Mahout, except manually set the Spark and Scala versions - the easiest way being: + + git clone http://github.com/apache/mahout + cd mahout + mvn clean package -Dspark.version=2.1.0 -Dscala.version=2.11.8 -Dscala.compat.version=2.11 -DskipTests + + +**Download Zeppelin** + + cd /a/good/place/to/install/ + wget http://apache.mirrors.tds.net/zeppelin/zeppelin-0.7.1/zeppelin-0.7.1-bin-all.tgz + tar -xzf zeppelin-0.7.1-bin-all.tgz + cd zeppelin* + bin/zeppelin-daemon.sh start + +And that's it. Open a web browser and surf to [http://localhost:8080](http://localhost:8080) + +Proceed to Step 2. + +#### Option2: Build Zeppelin for Spark 1.6/Scala 2.10 + +We'll use Mahout binaries from Maven, so all you need to do is clone, and build Zeppelin- + + git clone http://github.com/apache/zeppelin + cd zeppelin + mvn clean package -Pspark1.6 -Pscala2.10 -DskipTests + +After it builds successfully... + + bin/zeppelin-daemon.sh start + +And that's it. Open a web browser and surf to [http://localhost:8080](http://localhost:8080) + +### Step2: Create the Mahout Spark Interpreter + +After opening your web browser and surfing to [http://localhost:8080](http://localhost:8080), click on the `Anonymous` +button on the top right corner, which will open a drop down. Then click `Interpreter`. + + + +At the top right, just below the blue nav bar- you will see two buttons, "Repository" and "+Create". Click on "+Create" + +The following screen should appear. + + + +In the **Interpreter Name** enter `mahoutSpark` (you can name it whatever you like, but this is what we'll assume you've +named it later in the tutorial) + +In the **Interpreter group** drop down, select `spark`. A bunch of other settings will now auto-populate. + +Scroll to the bottom of the **Properties** list. In the last row, you'll see two blank boxes. + +Add the following properies by clicking the "+" button to the right. + +<div class="table-striped"> +<table class="table"> + <tr> + <th>name</th> + <th>value</th> + </tr> + <tr> + <td>spark.kryo.referenceTracking</td> + <td>false</td> + </tr> + <tr> + <td>spark.kryo.registrator</td> + <td>org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator</td> + </tr> + <tr> + <td>spark.kryoserializer.buffer</td> + <td>32</td> + </tr> + <tr> + <td>spark.kryoserializer.buffer.max</td> + <td>600m</td> + </tr> + <tr> + <td>spark.serializer</td> + <td>org.apache.spark.serializer.KryoSerializer</td> + </tr> +</table> +</div> + +### Step 3: Add Dependendencies +You'll also need to add the following **Dependencies**. + +#### If you chose Option1 in Step 1: + +Where `/path/to/mahout` is the path to the directory where you've built mahout. + +<div class="table-striped"> +<table class="table"> + <tr> + <th>artifact</th> + <th>exclude</th> + </tr> + <tr> + <td>/path/to/mahout/mahout-math-0.13.0.jar</td> + <td></td> + </tr> + <tr> + <td>/path/to/mahout/mahout-math-scala_2.11-0.13.0.jar</td> + <td></td> + </tr> + <tr> + <td>/path/to/mahout/mahout-spark_2.11-0.13.0.jar</td> + <td></td> + </tr> + <tr> + <td>/path/to/mahout/mahout-spark_2.11-0.13.0-dependeny-reduced.jar</td> + <td></td> + </tr> +</table> +</div> + +#### If you chose Option2 in Step 1: + +<div class="table-striped"> +<table class="table"> + <tr> + <th>artifact</th> + <th>exclude</th> + </tr> + <tr> + <td>org.apache.mahout:mahout-math:0.13.0</td> + <td></td> + </tr> + <tr> + <td>org.apache.mahout:mahout-math-scala_2.10:0.13.0</td> + <td></td> + </tr> + <tr> + <td>org.apache.mahout:mahout-spark_2.10:0.13.0</td> + <td></td> + </tr> + <tr> + <td>org.apache.mahout:mahout-native-viennacl-omp_2.10:0.13.0</td> + <td></td> + </tr> + +</table> +</div> + + +_**OPTIONALLY**_ You can add **one** of the following artifacts for CPU/GPU acceleration. + +<div class="table-striped"> +<table class="table"> + <tr> + <th>artifact</th> + <th>exclude</th> + <th>type of native solver</th> + </tr> + <tr> + <td>org.apache.mahout:mahout-native-viennacl_2.10:0.13.0</td> + <td></td> + <td>ViennaCL GPU Accelerated</td> + </tr> + <tr> + <td>org.apache.mahout:mahout-native-viennacl-omp_2.10:0.13.0</td> + <td></td> + <td>ViennaCL-OMP CPU Accelerated (use this if you don't have a good graphics card)</td> + </tr> +</table> +</div> + +Make sure to click "Save" and you're all set. + +### Step 4. Rock and Roll. + +Mahout in Zeppelin, unlike the Mahout Shell, won't take care of importing the Mahout libraries or creating the +`MahoutSparkContext`, we need to do that manually. This is easy though. When ever you start Zeppelin (or restart) the +Mahout interpreter, you'll need to run the following code first: + + %sparkMahout + + import org.apache.mahout.math._ + import org.apache.mahout.math.scalabindings._ + import org.apache.mahout.math.drm._ + import org.apache.mahout.math.scalabindings.RLikeOps._ + import org.apache.mahout.math.drm.RLikeDrmOps._ + import org.apache.mahout.sparkbindings._ + + implicit val sdc: org.apache.mahout.sparkbindings.SparkDistributedContext = sc2sdc(sc) + + +At this point, you have a Zeppelin Interpreter which will behave like the `$MAHOUT_HOME/bin/mahout spark-shell` + +Except, much much more. + +At the begining I mentioned a few important features of Zeppelin, that we could leverage to use Zeppelin for visualizatoins. + +#### Example 1: Visualizing a Matrix (Sample) + +In Mahout we can use `Matrices.symmetricUniformView` to create a Gaussian Matrix. + +We can use `.mapBlock` and some clever code to create a 3D Gausian Matrix. + +We can use `.drmSampleToTsv` to take a sample of the matrix and turn it in to a tab seperated string. We take a sample of + the matrix because, since we are dealing with "big" data, we wouldn't want to try to collect and plot the entire matrix, + however, IF we knew we had a small matrix and we DID want to sample the entire thing, then we could sample `100.0` e.g. 100%. + +Finally we use `z.put(...)` to put a variable into Zeppelin's `ResourcePool` a block of memory shared by all interpreters. + + + %sparkMahout + + val mxRnd3d = Matrices.symmetricUniformView(5000, 3, 1234) + val drmRand3d = drmParallelize(mxRnd3d) + + val drmGauss = drmRand3d.mapBlock() {case (keys, block) => + val blockB = block.like() + for (i <- 0 until block.nrow) { + val x: Double = block(i, 0) + val y: Double = block(i, 1) + val z: Double = block(i, 2) + + blockB(i, 0) = x + blockB(i, 1) = y + blockB(i, 2) = Math.exp(-((Math.pow(x, 2)) + (Math.pow(y, 2)))/2) + } + keys -> blockB + } + + resourcePool.put("gaussDrm", drm.drmSampleToTSV(drmGauss, 50.0)) + +Here we sample 50% of the matrix and put it in the `ResourcePool` under a variable named "gaussDrm". + +Now, for the exciting part. Scala doesn't have a lot of great graphing utilities. But you know who does? R and Python. So +instead of trying to akwardly visualize our data using Scala, let's just use R and Python. + +We start the Spark R interpreter (we do this because the regular R interpreter doesn't have access to the resource pools). + +We `z.get` the variable we just put in. + +We use R's `read.table` to read the string- this is very similar to how we would read a tsv file in R. + +Then we plot the data using the R `scatterplot3d` package. + +**Note** you may need to install `scatterplot3d`. In Ubuntu, do this with `sudo apt-get install r-cran-scatterplot3d` + + + %spark.r {"imageWidth": "400px"} + + library(scatterplot3d) + + + gaussStr = z.get("gaussDrm") + data <- read.table(text= gaussStr, sep="\t", header=FALSE) + + scatterplot3d(data, color="green") + + \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/0b38f516/website/docs/tutorials/mahout-in-zeppelin/zeppelin1.png ---------------------------------------------------------------------- diff --git a/website/docs/tutorials/mahout-in-zeppelin/zeppelin1.png b/website/docs/tutorials/mahout-in-zeppelin/zeppelin1.png new file mode 100644 index 0000000..54fcbc2 Binary files /dev/null and b/website/docs/tutorials/mahout-in-zeppelin/zeppelin1.png differ http://git-wip-us.apache.org/repos/asf/mahout/blob/0b38f516/website/docs/tutorials/mahout-in-zeppelin/zeppelin2.png ---------------------------------------------------------------------- diff --git a/website/docs/tutorials/mahout-in-zeppelin/zeppelin2.png b/website/docs/tutorials/mahout-in-zeppelin/zeppelin2.png new file mode 100644 index 0000000..724cf7a Binary files /dev/null and b/website/docs/tutorials/mahout-in-zeppelin/zeppelin2.png differ http://git-wip-us.apache.org/repos/asf/mahout/blob/0b38f516/website/docs/tutorials/mahout-in-zeppelin/zeppelin3.png ---------------------------------------------------------------------- diff --git a/website/docs/tutorials/mahout-in-zeppelin/zeppelin3.png b/website/docs/tutorials/mahout-in-zeppelin/zeppelin3.png new file mode 100644 index 0000000..2136c5b Binary files /dev/null and b/website/docs/tutorials/mahout-in-zeppelin/zeppelin3.png differ http://git-wip-us.apache.org/repos/asf/mahout/blob/0b38f516/website/front/community/blogs.md ---------------------------------------------------------------------- diff --git a/website/front/community/blogs.md b/website/front/community/blogs.md index 7ddd0ac..b169455 100644 --- a/website/front/community/blogs.md +++ b/website/front/community/blogs.md @@ -11,6 +11,11 @@ theme: Description --> +### [Precanned Algorithms in Apache Mahout](https://rawkintrevo.org/2017/05/02/introducing-pre-canned-algorithms-apache-mahout/) +**Trevor Grant** | _05/02/2017_ | rawkintrevo.org + +An introduction to the new Algorithms Framework + ### [Getting Started With Apache Mahout](https://datascience.ibm.com/blog/getting-started-with-apache-mahout-2/) **Trevor Grant** | _04/25/2017_ | datascience.ibm.com/blog
