[20/52] [partial] mahout git commit: MAHOUT-1981 Merged site updates, fixed navbars, Mathjax

rawkintrevo Mon, 04 Dec 2017 18:54:13 -0800

http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/tutorials/misc/contributing-algos/index.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/misc/contributing-algos/index.md 
b/website-old/docs/tutorials/misc/contributing-algos/index.md
deleted file mode 100644
index 618dd4b..0000000
--- a/website-old/docs/tutorials/misc/contributing-algos/index.md
+++ /dev/null
@@ -1,416 +0,0 @@
----
-layout: default
-title: Contributing new algorithms
-theme: 
-    name: mahout2
----
-
-The Mahout community is driven by user contribution.  If you have implemented 
an algorithm and are interested in 
-sharing it with the rest of the community, we highly encourage you to 
contribute it to the codebase.  However to
-keep things from getting too out of control, we have instituted a standard API 
for our algorithms, and we ask you 
-contribute your algorithm in a way that conforms to this API as much as 
possible.  You can always reach out on 
-[d...@mahout.apache.org](d...@mahout.apache.org) if you need help.
-
-In this example, let's say you've created a totally new algorithm- a 
regression algorithm called `Foo`
-
-The `Foo` algorithm is a silly algorithm- it always guesses the target is 1. 
Not at all useful as an algorithm, but great for
-illustrating how an algorithm would be added.
-
-## Step 1: Create  JIRA ticket
-
-Go to [the Mahout JIRA board](https://issues.apache.org/jira/browse/MAHOUT/) 
and click the red "Create" button. 
-
-If you don't see the button, you may need to be granted contributor rights on 
the JIRA board.  Send a message to 
[d...@mahout.apache.org](d...@mahout.apache.org) and someone will add you.
- 
-Once you click "Create" a dialog box will pop up. 
- 
-In the Summary box write a short description, something like `Implement Foo 
Algorithm`
-
-Under assignee- click "Assign to Me", this lets everyone know you're working 
on the algorithm.
- 
-In the description, it would be good to describe the algorithm, and if 
possible link to a Wikipedia article about the algorithm/method or better yet 
an academic journal article.
-
-![Jira Dialogue box](new-jira.png)
-
-Once you've creted- a JIRA number will be assigned.  For example, here is the 
JIRA ticket for me writing this tutorial. 
-
-![New Issue Created](jira.png)
-
-This is MAHOUT-1980.  Whatever number your new issue is assigned, we'll refer 
to as XXXX for the rest of the tutorial.
-
-## Step 2. Clone Mahout and create branch
-
-Supposing you don't already have a copy of Mahout on your computer open a 
terminal and type the following
-   
-    git clone http://github.com/apache/mahout
-    
-This will clone the Mahout source code into a directory called `mahout`.  Go 
into that directory and create a new branch called
-`mahout-xxxx` (where `xxxx` is your JIRA number from step 1)
-    
-    cd mahout
-    git checkout -b mahout-xxxx
-
-## Step 3. Create Classes for your new Algorithm
-
-**NOTE** I am using IntelliJ Community-Edition as an IDE.  There are several 
good IDEs that exist, and I _highly_ reccomend
-you use one, but you can do what you like.  As far as screen shots go though, 
that is what I'm working with here.
-
-Create a file 
`mahout/math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/FooModel.scala`
-
-The first thing to add to the file is a license:
-    
-    /**
-      * Licensed to the Apache Software Foundation (ASF) under one
-      * or more contributor license agreements. See the NOTICE file
-      * distributed with this work for additional information
-      * regarding copyright ownership. The ASF licenses this file
-      * to you under the Apache License, Version 2.0 (the
-      * "License"); you may not use this file except in compliance
-      * with the License. You may obtain a copy of the License at
-      *
-      * http://www.apache.org/licenses/LICENSE-2.0
-      *
-      * Unless required by applicable law or agreed to in writing,
-      * software distributed under the License is distributed on an
-      * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-      * KIND, either express or implied. See the License for the
-      * specific language governing permissions and limitations
-      * under the License.
-      */
-      
-The next thing to add to the file is the `package` statement. 
-
-    package org.apache.mahout.math.algorithms.regression
-    
-And finally declare the Fitter and Model classes
-
-    class Foo extends RegressorFitter {
-    
-    }
-    
-    class FooModel extends RegressorModel {
-    
-    }
-    
-
-The Fitter class holds the methods for fitting which returns a Model, the 
Model class hold the parameters for the model and 
- methods for using on new data sets. In a RegressorModel that is going to be a 
`predict()` method.
-  
-In your algorithm, most of your code is going to go into the `.fit` method. 
Since this is just a silly example, we're 
- don't really have anything to fit- so we're just going to return a FooModel 
(because that is what the Fitter must do)
- 
-    class Foo[K] extends RegressorFitter[K] {
-    
-      def fit(drmX  : DrmLike[K],
-              drmTarget: DrmLike[K],
-              hyperparameters: (Symbol, Any)*): FooModel[K] ={
-        /**
-          * Normally one would have a lot more code here.
-          */
-
-        var model = new FooModel[K]
-        model.summary = "This model has been fit, I would tell you more 
interesting things- if there was anything to tell."
-        model
-      }
-    }
-    
-    class FooModel[K] extends RegressorModel[K] {
-    
-      def predict(drmPredictors: DrmLike[K]): DrmLike[K] = {
-        drmPredictors.mapBlock(1) {
-          case (keys, block: Matrix) => {
-            var outputBlock = new DenseMatrix(block.nrow, 1)
-            keys -> (outputBlock += 1.0)
-          }
-        }
-      }
-    }
-
-I've also added something to the summary string. It wasn't a very helpful 
thing, but this isn't a very helpful algorithm. I included
-as a reminder to you, the person writing a useful algorithm, that this is a 
good place to talk about the results of the fitting.
-
-At this point it would be reasonable to try building Mahout and checking that 
your algorithm is working the way you expect it to
-
-    mvn clean package -DskipTests
-    
-I like to use the Mahout Spark-Shell for cases like this.
-
-    cd $MAHOUT_HOME/bin
-    ./mahout spark-shell
-    
-Then once I'm in the shell:
-
-    import org.apache.mahout.math.algorithms.regression.Foo
-    
-    val drmA = drmParallelize(dense((1.0, 1.2, 1.3, 1.4), (1.1, 1.5, 2.5, 
1.0), (6.0, 5.2, -5.2, 5.3), (7.0,6.0, 5.0, 5.0), (10.0, 1.0, 20.0, -10.0)))
-    
-    val model = new Foo().fit(drmA(::, 0 until 2), drmA(::, 2 until 3))
- 
-    model.predict(drmA).collect
-    
-And everything seems to be in order. 
-
-    res5: org.apache.mahout.math.Matrix =
-    {
-     0 =>   {0:1.0}
-     1 =>   {0:1.0}
-     2 =>   {0:1.0}
-     3 =>   {0:1.0}
-     4 =>   {0:1.0}
-    }
-    
-## Step 4. Working with Hyper Parameters
-
-It's entirely likely you'll need to have hyper-parameters to tune your 
algorithm. 
-
-In Mahout we handle these with a map of Symbols. You might have noticed in the 
`fit` and `predict` methods we included 
-`hyperparameters: (Symbol, Any)*`
-
-Well let's look at how we would work with those.
-
-Suppose instead of always guessing "1.0" we wanted to guess some user defined 
number (a very silly algorithm). 
-
-We'll be adding a parameter called `guessThisNumber` to the Fitter method. By 
convention, we usually create a function in 
-the fitter called `setStandardHyperparameters` and let that take care of 
setting up all of our hyperparameters and then call
-that function inside of fit. This keeps things nice and clean. 
-
-    class Foo[K] extends RegressorFitter[K] {
-    
-      var guessThisNumber: Double = _
-    
-      def setStandardHyperparameters(hyperparameters: Map[Symbol, Any] = 
Map('foo -> None)): Unit = {
-        guessThisNumber = hyperparameters.asInstanceOf[Map[Symbol, 
Double]].getOrElse('guessThisNumber, 1.0)
-      }
-      def fit(drmX  : DrmLike[K],
-              drmTarget: DrmLike[K],
-              hyperparameters: (Symbol, Any)*): FooModel[K] ={
-        /**
-          * Normally one would have a lot more code here.
-          */
-    
-        var model = new FooModel[K]
-    
-        setStandardHyperparameters(hyperparameters.toMap)
-        model.guessThisNumber = guessThisNumber
-        model.summary = s"This model will always guess 
${model.guessThisNumber}"
-        model
-      }
-    }
-    
-Also notice we set the _default value_ to 1.0.  We also now have something to 
add into the summary string.
-
-To implement this, we'll need to broadcast the guessed number in the `predict` 
method.  In Mahout you can only broadcast
-`Vectors` and `Matrices`.  We use `drmBroadcast` It might be tempting to use 
the broadcast method of the underlying engine, but 
-this is a no-no.  The reason for this is that we want to keep our algorithm 
abstracted over multiple distributed engines. 
-
-    class FooModel[K] extends RegressorModel[K] {
-    
-      var guessThisNumber: Double = _
-    
-      def predict(drmPredictors: DrmLike[K]): DrmLike[K] = {
-    
-        // This is needed for MapBlock
-        implicit val ktag =  drmPredictors.keyClassTag
-    
-        // This is needed for broadcasting
-        implicit val ctx = drmPredictors.context
-        
-        val bcGuess = drmBroadcast(dvec(guessThisNumber))
-        drmPredictors.mapBlock(1) {
-          case (keys, block: Matrix) => {
-            var outputBlock = new DenseMatrix(block.nrow, 1)
-            keys -> (outputBlock += bcGuess.get(0))
-          }
-        }
-      }
-    }
-    
-We can get pretty creative with what sort of information we can send out in 
the broadcast even when it's just Vectors and Matrices
-
-Here we get the single number we need by broadcasting it in a length 1 vector. 
 We then `get` our number from that position.
-
-    keys -> (outputBlock += bcGuess.get(0))
-    
-
-Let's open up the `$MAHOUT_HOME/bin/mahout spark-shell` and try out the 
hyperparameter
-
-    import org.apache.mahout.math.algorithms.regression.Foo
-        
-    val drmA = drmParallelize(dense((1.0, 1.2, 1.3, 1.4), (1.1, 1.5, 2.5, 
1.0), (6.0, 5.2, -5.2, 5.3), (7.0,6.0, 5.0, 5.0), (10.0, 1.0, 20.0, -10.0)))
-    
-    val model = new Foo().fit(drmA(::, 0 until 2), drmA(::, 2 until 3), 
'guessThisNumber -> 2.0)
- 
-    model.predict(drmA).collect
-
-## Step 5. Unit Tests
-
-Ahh unit tests, so tedious but so important.  We want to create a unit test 
that will break if someone changes some other code
-that causes our algorithm to fail (or guess the wrong numbers)
-
-By convention, if an algorithm has been written in R it is best to make a very 
small sample data set, and run the R version
-of the algorithm- and then verify your algorithm generates the same results, 
or explain why it there is divergence.
-
-Since no one has implemented Foo in any other packages however, we will have 
to be very sure our algorithm correct by doing
-it by hand in R, and in a way others can verify.  Actually- I'm not making a 
prototype of the Foo algorithm, but I do want to impress
-that it _is_ very important that you do, so that the person reviewing the PR 
can verify your results. (Maybe there is an example 
-in the paper you read).
-
-Since this is a regression model, open up 
-
-    
$MAHOUT_HOME/math-scala/src/test/scala/org/apache/mahout/math/algorithms/RegressionSuiteBase.scala
-    
-You'll see some other tests in there to get you started. You'll also see some 
R-Prototypes, and in the case of `cochrane orcutt` 
-where the R implementation had divergent results and my argument for why our 
algorithm was right. 
-
-I'm going to create a new test called `foo test` and I'm going to build it 
similar to the example I've been using.
-
-    test("foo") {
-        import org.apache.mahout.math.algorithms.regression.Foo
-    
-        val drmA = drmParallelize(dense((1.0, 1.2, 1.3, 1.4),
-                                        (1.1, 1.5, 2.5, 1.0),
-                                        (6.0, 5.2, -5.2, 5.3),
-                                        (7.0,6.0, 5.0, 5.0),
-                                        (10.0, 1.0, 20.0, -10.0)))
-    
-        val model = new Foo().fit(drmA(::, 0 until 2), drmA(::, 2 until 3), 
'guessThisNumber -> 2.0)
-    
-        val myAnswer = model.predict(drmA).collect
-        val correctAnswer = dense( (2.0),
-                                    (2.0),
-                                    (2.0),
-                                    (2.0),
-                                    (2.0))
-    
-    
-        val epsilon = 1E-6
-        (myAnswer - correctAnswer).sum should be < epsilon
-      }
-      
-Note the use of `epsilon`.  The answer really _should be_ 0.0. But, especially 
for more complicated algorithms, we allow 
-for a touch of machine rounding error.
-
-Now build and check your tests by building without skipping the tests
-
-    mvn clean package
-    
-## Step 6. Add documentation to the website. 
-
-Now you've created this awesome algorithm- time to do a little marketing!  
Create the following file:
-
-    $MAHOUT_HOME/website/docs/algorithms/regression/foo.md
-
-In that file create a blank Jekyll template:
-    
-    ---
-    layout: algorithm
-    title: Foo
-    theme:
-        name: mahout2
-    ---
-    
-    ### About
-    
-    Foo is a very famous and useful algorithm. Let me tell you lots about it...
-    
-    [A mark down link](https://en.wikipedia.org/wiki/Foobar)
-    
-    ### Parameters
-    
-    <div class="table-striped">
-      <table class="table">
-        <tr>
-            <th>Parameter</th>
-            <th>Description</th>
-            <th>Default Value</th>
-        </tr>
-        <tr>
-            <td><code>'guessThisNumber</code></td>
-            <td>The number to guess</td>
-            <td><code>1.0</code></td>
-        </tr>
-        </table>
-        </div>
-    
-    ### Example
-    import org.apache.mahout.math.algorithms.regression.Foo
-            
-    val drmA = drmParallelize(dense((1.0, 1.2, 1.3, 1.4), (1.1, 1.5, 2.5, 
1.0), (6.0, 5.2, -5.2, 5.3), (7.0,6.0, 5.0, 5.0), (10.0, 1.0, 20.0, -10.0)))
-    
-    val model = new Foo().fit(drmA(::, 0 until 2), drmA(::, 2 until 3), 
'guessThisNumber -> 2.0)
- 
-    model.predict(drmA).collect
-
-The firse few lines between the `---` is the header, this has the title and 
tells Jekyll what sort of page this is, it knows 
-elsewhere based on that, how to compile the page (add navbars, etc).
-
-The _About_ section, is your chance to really dive into the algorithm and its 
implementation. More is more. If you didn't have an 
-R prototype in the unit tests (or have a divergent answer from R) this is a 
good place to really expand on that.
- 
-The _Parameters_ section is a good reference for users to know that the 
hyperparameters are and what they do.
-
-The _Example_ section is a quick little example to get users started. You may 
have noticed I used the same code I used in 
-the unit test.  That's something I do often, but there is no reason you have 
to. If you want to come up with a more illustrative example 
-(or several illustrative examples) that's encouraged.
-Add links to the nav-bars
-
-    $MAHOUT_HOME/website/docs/_includes/algo_navbar.html
-    $MAHOUT_HOME/website/docs/_includes/navbar.html
-
-You can look at the links already in there, but it's going to look something 
like
-
-    <li> <a href="{{ BASE_PATH }}/algorithms/regression/Foo.html">Foo</a></li>
-
-Jeckyll will convert your *.md file into *.html at the same place on the 
directory tree.
-
-To check that your webpage look right:
-
-    cd $MAHOUT_HOME/website/docs
-    jeckyll --serve
-
-Then open a webbrowser and go to [http://localhost:4000](http://localhost:4000)
-
-If you're feeling really froggy- you're also welcome to add a tutorial :)
-
-## Step 7. Commit Changes, Push to Github, and Open a PR
-
-Open a terminal, return to the `mahout` top directory and type
-    
-    git status
-    
-You'll see `Changes not staged for commit`.  
-
-Any file you touched will be listed there, but we only want to stage the files 
we were in. 
-
-For this tutorial it was
-
-    git add 
mahout/math-scala/src/main/scala/org/apache/mahout/math/algorithms/regression/FooModel.scala
-    git add 
math-scala/src/test/scala/org/apache/mahout/math/algorithms/RegressionSuiteBase.scala
-    git add website/docs/algorithms/regression/foo.md
-    git add website/docs/_includes/algo_navbar.html
-    git add website/docs/_includes/navbar.html
-    
-Now we _commit_ our changes. We add a message that starts with `MAHOUT-xxxx` 
where `xxxx` is the JIRA number your issue was
-assigned, then a descriptive title.
-
-    git commit -m "MAHOUT-xxxx Implement Foo Algorithm"
-    
-Finally, _push_ the changes to your local repository. The `-u` flag will 
create a new branch
-
-    git push -u origin mahout-xxxx
-
-Now in your browser, go to 
[http://github.com/yourusername/mahout](http://github.com/yourusername/mahout)
-
-There is a "branch" drop down menu- scroll down and find `mahout-xxxx`
-
-![i have lots of issues](github-branch.png)
-
-Towards the top, off to the right, you'll see a link to "Pull Request", click 
on this- and follow the prompts!
-
-![create PR](create-pr.png)
-
-## Conclusion
-
-That's it!  Thanks so much for contributing to Apache Mahout, users like you 
are what keep this project moving forward!
-
-I've included [Foo.scala](Foo.scala) and 
[RegressionSuiteBase.scala](RegressionSuiteBase.scala) for reference. 
\ No newline at end of file


http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/tutorials/misc/contributing-algos/jira.png
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/misc/contributing-algos/jira.png 
b/website-old/docs/tutorials/misc/contributing-algos/jira.png
deleted file mode 100644
index d4c5bcd..0000000
Binary files a/website-old/docs/tutorials/misc/contributing-algos/jira.png and 
/dev/null differ

http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/tutorials/misc/contributing-algos/new-jira.png
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/misc/contributing-algos/new-jira.png 
b/website-old/docs/tutorials/misc/contributing-algos/new-jira.png
deleted file mode 100644
index efaba7e..0000000
Binary files a/website-old/docs/tutorials/misc/contributing-algos/new-jira.png 
and /dev/null differ

http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/tutorials/misc/how-to-build-an-app.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/misc/how-to-build-an-app.md 
b/website-old/docs/tutorials/misc/how-to-build-an-app.md
deleted file mode 100644
index a17c189..0000000
--- a/website-old/docs/tutorials/misc/how-to-build-an-app.md
+++ /dev/null
@@ -1,256 +0,0 @@
----
-layout: tutorial
-title: Mahout Samsara In Core
-theme:
-    name: mahout2
----
-# How to create and App using Mahout
-
-This is an example of how to create a simple app using Mahout as a Library. 
The source is available on Github in the [3-input-cooc 
project](https://github.com/pferrel/3-input-cooc) with more explanation about 
what it does (has to do with collaborative filtering). For this tutorial we'll 
concentrate on the app rather than the data science.
-
-The app reads in three user-item interactions types and creats indicators for 
them using cooccurrence and cross-cooccurrence. The indicators will be written 
to text files in a format ready for search engine indexing in search engine 
based recommender.
-
-## Setup
-In order to build and run the CooccurrenceDriver youÂ need to install the 
following:
-
-* Install the Java 7 JDK from Oracle. Mac users look here: [Java SE 
Development Kit 
7u72](http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html).
-* Install sbt (simple build tool) 0.13.x for 
[Mac](http://www.scala-sbt.org/release/tutorial/Installing-sbt-on-Mac.html), 
[Linux](http://www.scala-sbt.org/release/tutorial/Installing-sbt-on-Linux.html) 
or [manual 
instalation](http://www.scala-sbt.org/release/tutorial/Manual-Installation.html).
-* Install [Spark 
1.1.1](https://spark.apache.org/docs/1.1.1/spark-standalone.html). Don't forget 
to setup SPARK_HOME
-* Install [Mahout 0.10.0](http://mahout.apache.org/general/downloads.html). 
Don't forget to setup MAHOUT_HOME and MAHOUT_LOCAL
-
-Why install if you are only using them as a library? Certain binaries and 
scripts are required by the libraries to get information about the environment 
like discovering where jars are located.
-
-Spark requires a set of jars on the classpath for the client side part of an 
app and another set of jars must be passed to the Spark Context for running 
distributed code. The example should discover all the neccessary classes 
automatically.
-
-## Application
-Using Mahout as a library in an application will require a little Scala code. 
Scala has an App trait so we'll create an object, which inherits from ```App```
-
-
-    object CooccurrenceDriver extends App {
-    }
-    
-
-This will look a little different than Java since ```App``` does delayed 
initialization, which causes the body to be executed when the App is launched, 
just as in Java you would create a main method.
-
-Before we can execute something on Spark we'll need to create a context. We 
could use raw Spark calls here but default values are setup for a Mahout 
context by using the Mahout helper function.
-
-    implicit val mc = mahoutSparkContext(masterUrl = "local", 
-      appName = "CooccurrenceDriver")
-    
-We need to read in three files containing different interaction types. The 
files will each be read into a Mahout IndexedDataset. This allows us to 
preserve application-specific user and item IDs throughout the calculations.
-
-For example, here is data/purchase.csv:
-
-    u1,iphone
-    u1,ipad
-    u2,nexus
-    u2,galaxy
-    u3,surface
-    u4,iphone
-    u4,galaxy
-
-Mahout has a helper function that reads the text delimited files  
SparkEngine.indexedDatasetDFSReadElements. The function reads single element 
tuples (user-id,item-id) in a distributed way to create the IndexedDataset. 
Distributed Row Matrices (DRM) and Vectors are important data types supplied by 
Mahout and IndexedDataset is like a very lightweight Dataframe in R, it wraps a 
DRM with HashBiMaps for row and column IDs. 
-
-One important thing to note about this example is that we read in all datasets 
before we adjust the number of rows in them to match the total number of users 
in the data. This is so the math works out [(A'A, A'B, 
A'C)](http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html) 
even if some users took one action but not another there must be the same 
number of rows in all matrices.
-
-    /**
-     * Read files of element tuples and create IndexedDatasets one per action. 
These 
-     * share a userID BiMap but have their own itemID BiMaps
-     */
-    def readActions(actionInput: Array[(String, String)]): Array[(String, 
IndexedDataset)] = {
-      var actions = Array[(String, IndexedDataset)]()
-
-      val userDictionary: BiMap[String, Int] = HashBiMap.create()
-
-      // The first action named in the sequence is the "primary" action and 
-      // begins to fill up the user dictionary
-      for ( actionDescription <- actionInput ) {// grab the path to actions
-        val action: IndexedDataset = SparkEngine.indexedDatasetDFSReadElements(
-          actionDescription._2,
-          schema = DefaultIndexedDatasetElementReadSchema,
-          existingRowIDs = userDictionary)
-        userDictionary.putAll(action.rowIDs)
-        // put the name in the tuple with the indexedDataset
-        actions = actions :+ (actionDescription._1, action) 
-      }
-
-      // After all actions are read in the userDictonary will contain every 
user seen, 
-      // even if they may not have taken all actions . Now we adjust the row 
rank of 
-      // all IndxedDataset's to have this number of rows
-      // Note: this is very important or the cooccurrence calc may fail
-      val numUsers = userDictionary.size() // one more than the cardinality
-
-      val resizedNameActionPairs = actions.map { a =>
-        //resize the matrix by, in effect by adding empty rows
-        val resizedMatrix = a._2.create(a._2.matrix, userDictionary, 
a._2.columnIDs).newRowCardinality(numUsers)
-        (a._1, resizedMatrix) // return the Tuple of (name, IndexedDataset)
-      }
-      resizedNameActionPairs // return the array of Tuples
-    }
-
-
-Now that we have the data read in we can perform the cooccurrence calculation.
-
-    // actions.map creates an array of just the IndeedDatasets
-    val indicatorMatrices = SimilarityAnalysis.cooccurrencesIDSs(
-      actions.map(a => a._2)) 
-
-All we need to do now is write the indicators.
-
-    // zip a pair of arrays into an array of pairs, reattaching the action 
names
-    val indicatorDescriptions = actions.map(a => a._1).zip(indicatorMatrices)
-    writeIndicators(indicatorDescriptions)
-
-
-The ```writeIndicators``` method uses the default write function 
```dfsWrite```.
-
-    /**
-     * Write indicatorMatrices to the output dir in the default format
-     * for indexing by a search engine.
-     */
-    def writeIndicators( indicators: Array[(String, IndexedDataset)]) = {
-      for (indicator <- indicators ) {
-        // create a name based on the type of indicator
-        val indicatorDir = OutputPath + indicator._1
-        indicator._2.dfsWrite(
-          indicatorDir,
-          // Schema tells the writer to omit LLR strengths 
-          // and format for search engine indexing
-          IndexedDatasetWriteBooleanSchema) 
-      }
-    }
- 
-
-See the Github project for the full source. Now we create a build.sbt to build 
the example. 
-
-    name := "cooccurrence-driver"
-
-    organization := "com.finderbots"
-
-    version := "0.1"
-
-    scalaVersion := "2.10.4"
-
-    val sparkVersion = "1.1.1"
-
-    libraryDependencies ++= Seq(
-      "log4j" % "log4j" % "1.2.17",
-      // Mahout's Spark code
-      "commons-io" % "commons-io" % "2.4",
-      "org.apache.mahout" % "mahout-math-scala_2.10" % "0.10.0",
-      "org.apache.mahout" % "mahout-spark_2.10" % "0.10.0",
-      "org.apache.mahout" % "mahout-math" % "0.10.0",
-      "org.apache.mahout" % "mahout-hdfs" % "0.10.0",
-      // Google collections, AKA Guava
-      "com.google.guava" % "guava" % "16.0")
-
-    resolvers += "typesafe repo" at " 
http://repo.typesafe.com/typesafe/releases/";
-
-    resolvers += Resolver.mavenLocal
-
-    packSettings
-
-    packMain := Map(
-      "cooc" -> "CooccurrenceDriver")
-
-
-## Build
-Building the examples from project's root folder:
-
-    $ sbt pack
-
-This will automatically set up some launcher scripts for the driver. To run 
execute
-
-    $ target/pack/bin/cooc
-    
-The driver will execute in Spark standalone mode and put the data in 
/path/to/3-input-cooc/data/indicators/*indicator-type*
-
-## Using a Debugger
-To build and run this example in a debugger like IntelliJ IDEA. Install from 
the IntelliJ site and add the Scala plugin.
-
-Open IDEA and go to the menu File->New->Project from existing 
sources->SBT->/path/to/3-input-cooc. This will create an IDEA project from 
```build.sbt``` in the root directory.
-
-At this point you may create a "Debug Configuration" to run. In the menu 
choose Run->Edit Configurations. Under "Default" choose "Application". In the 
dialog hit the elipsis button "..." to the right of "Environment Variables" and 
fill in your versions of JAVA_HOME, SPARK_HOME, and MAHOUT_HOME. In 
configuration editor under "Use classpath from" choose root-3-input-cooc 
module. 
-
-![image](http://mahout.apache.org/images/debug-config.png)
-
-Now choose "Application" in the left pane and hit the plus sign "+". give the 
config a name and hit the elipsis button to the right of the "Main class" field 
as shown.
-
-![image](http://mahout.apache.org/images/debug-config-2.png)
-
-
-After setting breakpoints you are now ready to debug the configuration. Go to 
the Run->Debug... menu and pick your configuration. This will execute using a 
local standalone instance of Spark.
-
-##The Mahout Shell
-
-For small script-like apps you may wish to use the Mahout shell. It is a Scala 
REPL type interactive shell built on the Spark shell with Mahout-Samsara 
extensions.
-
-To make the CooccurrenceDriver.scala into a script make the following changes:
-
-* You won't need the context, since it is created when the shell is launched, 
comment that line out.
-* Replace the logger.info lines with println
-* Remove the package info since it's not needed, this will produce the file in 
```path/to/3-input-cooc/bin/CooccurrenceDriver.mscala```. 
-
-Note the extension ```.mscala``` to indicate we are using Mahout's scala 
extensions for math, otherwise known as 
[Mahout-Samsara](http://mahout.apache.org/users/environment/out-of-core-reference.html)
-
-To run the code make sure the output does not exist already
-
-    $ rm -r /path/to/3-input-cooc/data/indicators
-    
-Launch the Mahout + Spark shell:
-
-    $ mahout spark-shell
-    
-You'll see the Mahout splash:
-
-    MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
-
-                         _                 _
-             _ __ ___   __ _| |__   ___  _   _| |_
-            | '_ ` _ \ / _` | '_ \ / _ \| | | | __|
-            | | | | | | (_| | | | | (_) | |_| | |_
-            |_| |_| |_|\__,_|_| |_|\___/ \__,_|\__|  version 0.10.0
-
-      
-    Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 
1.7.0_72)
-    Type in expressions to have them evaluated.
-    Type :help for more information.
-    15/04/26 09:30:48 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
-    Created spark context..
-    Mahout distributed context is available as "implicit val sdc".
-    mahout> 
-
-To load the driver type:
-
-    mahout> :load /path/to/3-input-cooc/bin/CooccurrenceDriver.mscala
-    Loading ./bin/CooccurrenceDriver.mscala...
-    import com.google.common.collect.{HashBiMap, BiMap}
-    import org.apache.log4j.Logger
-    import org.apache.mahout.math.cf.SimilarityAnalysis
-    import org.apache.mahout.math.indexeddataset._
-    import org.apache.mahout.sparkbindings._
-    import scala.collection.immutable.HashMap
-    defined module CooccurrenceDriver
-    mahout> 
-
-To run the driver type:
-
-    mahout> CooccurrenceDriver.main(args = Array(""))
-    
-You'll get some stats printed:
-
-    Total number of users for all actions = 5
-    purchase indicator matrix:
-      Number of rows for matrix = 4
-      Number of columns for matrix = 5
-      Number of rows after resize = 5
-    view indicator matrix:
-      Number of rows for matrix = 4
-      Number of columns for matrix = 5
-      Number of rows after resize = 5
-    category indicator matrix:
-      Number of rows for matrix = 5
-      Number of columns for matrix = 7
-      Number of rows after resize = 5
-    
-If you look in ```path/to/3-input-cooc/data/indicators``` you should find 
folders containing the indicator matrices.

http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/tutorials/misc/mahout-in-zeppelin/index.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/misc/mahout-in-zeppelin/index.md 
b/website-old/docs/tutorials/misc/mahout-in-zeppelin/index.md
deleted file mode 100644
index 93a15c0..0000000
--- a/website-old/docs/tutorials/misc/mahout-in-zeppelin/index.md
+++ /dev/null
@@ -1,276 +0,0 @@
----
-layout: tutorial
-title: Visualizing Mahout in Zeppelin
-theme: 
-    name: mahout2
----
-
-
-The [Apache Zeppelin](http://zeppelin.apache.org) is an exciting notebooking 
tool, designed for working with Big Data
-applications.  It comes with great integration for graphing in R and Python, 
supports multiple langauges in a single 
-notebook (and facilitates sharing of variables between interpreters), and 
makes working with Spark and Flink in an interactive environment (either 
locally or in cluster mode) a 
-breeze.  Of course, it does lots of other cool things too- but those are the 
features we're going to take advantage of.
-
-### Step1: Download and Install Zeppelin
-
-Zeppelin binaries by default use Spark 2.1 / Scala 2.11, until Mahout puts out 
Spark 2.1/Scala 2.11 binaries you have
-two options. 
-
-#### Option 1: Build Mahout for Spark 2.1/Scala 2.11
-
-**Build Mahout**
-
-Follow the standard procedures for building Mahout, except manually set the 
Spark and Scala versions - the easiest way being:
-    
-    git clone http://github.com/apache/mahout
-    cd mahout
-    mvn clean package -Dspark.version=2.1.0 -Dscala.version=2.11.8 
-Dscala.compat.version=2.11 -DskipTests
-    
-
-**Download Zeppelin**
-
-    cd /a/good/place/to/install/
-    wget 
http://apache.mirrors.tds.net/zeppelin/zeppelin-0.7.1/zeppelin-0.7.1-bin-all.tgz
-    tar -xzf zeppelin-0.7.1-bin-all.tgz
-    cd zeppelin*
-    bin/zeppelin-daemon.sh start
-
-And that's it. Open a web browser and surf to 
[http://localhost:8080](http://localhost:8080)
-
-Proceed to Step 2.
-
-#### Option2: Build Zeppelin for Spark 1.6/Scala 2.10
-
-We'll use Mahout binaries from Maven, so all you need to do is clone, and 
build Zeppelin-
-
-    git clone http://github.com/apache/zeppelin
-    cd zeppelin
-    mvn clean package -Pspark1.6 -Pscala2.10 -DskipTests
-
-After it builds successfully...
-
-    bin/zeppelin-daemon.sh start
-    
-And that's it. Open a web browser and surf to 
[http://localhost:8080](http://localhost:8080)
-
-### Step2: Create the Mahout Spark Interpreter
-
-After opening your web browser and surfing to 
[http://localhost:8080](http://localhost:8080), click on the `Anonymous` 
-button on the top right corner, which will open a drop down. Then click 
`Interpreter`.
-
-![Screen Shot1](zeppelin1.png)
-
-At the top right, just below the blue nav bar- you will see two buttons, 
"Repository" and "+Create".  Click on "+Create"
-
-The following screen should appear.
-
-![Screen Shot2](zeppelin2.png)
-
-In the **Interpreter Name** enter `mahoutSpark` (you can name it whatever you 
like, but this is what we'll assume you've
-named it later in the tutorial)
-
-In the **Interpreter group** drop down, select `spark`. A bunch of other 
settings will now auto-populate.
-
-Scroll to the bottom of the **Properties** list. In the last row, you'll see 
two blank boxes. 
-
-Add the following properies by clicking the "+" button to the right.
-
-<div class="table-striped">
-<table class="table">
-    <tr>
-        <th>name</th>
-        <th>value</th>
-    </tr>
-    <tr>
-        <td>spark.kryo.referenceTracking</td>
-        <td>false</td>
-    </tr>
-    <tr>
-        <td>spark.kryo.registrator</td>
-        <td>org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator</td>
-    </tr>
-    <tr>
-        <td>spark.kryoserializer.buffer</td>
-        <td>32</td>
-    </tr>
-    <tr>
-        <td>spark.kryoserializer.buffer.max</td>
-        <td>600m</td>
-    </tr>    
-    <tr>
-        <td>spark.serializer</td>
-        <td>org.apache.spark.serializer.KryoSerializer</td>
-    </tr> 
-</table>
-</div>
-
-### Step 3: Add Dependendencies
-You'll also need to add the following **Dependencies**.
-
-#### If you chose Option1 in Step 1:
-
-Where `/path/to/mahout` is the path to the directory where you've built mahout.
-
-<div class="table-striped">
-<table class="table">
-    <tr>
-        <th>artifact</th>
-        <th>exclude</th>
-    </tr>
-    <tr>
-        <td>/path/to/mahout/mahout-math-0.13.0.jar</td>
-        <td></td>
-    </tr>
-    <tr>
-        <td>/path/to/mahout/mahout-math-scala_2.11-0.13.0.jar</td>
-        <td></td>
-    </tr>
-    <tr>
-        <td>/path/to/mahout/mahout-spark_2.11-0.13.0.jar</td>
-        <td></td>
-    </tr>  
-    <tr>
-        <td>/path/to/mahout/mahout-spark_2.11-0.13.0-dependeny-reduced.jar</td>
-        <td></td>
-    </tr>
-</table>
-</div>
-
-#### If you chose Option2 in Step 1: 
-
-<div class="table-striped">
-<table class="table">
-    <tr>
-        <th>artifact</th>
-        <th>exclude</th>
-    </tr>
-    <tr>
-        <td>org.apache.mahout:mahout-math:0.13.0</td>
-        <td></td>
-    </tr>
-    <tr>
-        <td>org.apache.mahout:mahout-math-scala_2.10:0.13.0</td>
-        <td></td>
-    </tr>
-    <tr>
-        <td>org.apache.mahout:mahout-spark_2.10:0.13.0</td>
-        <td></td>
-    </tr>
-     <tr>
-         <td>org.apache.mahout:mahout-native-viennacl-omp_2.10:0.13.0</td>
-         <td></td>
-     </tr> 
-
-</table>
-</div>
-
-
-_**OPTIONALLY**_ You can add **one** of the following artifacts for CPU/GPU 
acceleration.
-
-<div class="table-striped">
-<table class="table">
-    <tr>
-        <th>artifact</th>
-        <th>exclude</th>
-        <th>type of native solver</th>
-    </tr>
-     <tr>
-         <td>org.apache.mahout:mahout-native-viennacl_2.10:0.13.0</td>
-         <td></td>
-         <td>ViennaCL GPU Accelerated</td>
-     </tr> 
-     <tr>
-         <td>org.apache.mahout:mahout-native-viennacl-omp_2.10:0.13.0</td>
-         <td></td>
-         <td>ViennaCL-OMP CPU Accelerated (use this if you don't have a good 
graphics card)</td>
-     </tr> 
-</table>
-</div>
-
-Make sure to click "Save" and you're all set. 
-
-### Step 4. Rock and Roll.
-
-Mahout in Zeppelin, unlike the Mahout Shell, won't take care of importing the 
Mahout libraries or creating the 
-`MahoutSparkContext`, we need to do that manually. This is easy though.  When 
ever you start Zeppelin (or restart) the 
-Mahout interpreter, you'll need to run the following code first:
-
-    %sparkMahout
-    
-    import org.apache.mahout.math._
-    import org.apache.mahout.math.scalabindings._
-    import org.apache.mahout.math.drm._
-    import org.apache.mahout.math.scalabindings.RLikeOps._
-    import org.apache.mahout.math.drm.RLikeDrmOps._
-    import org.apache.mahout.sparkbindings._
-    
-    implicit val sdc: org.apache.mahout.sparkbindings.SparkDistributedContext 
= sc2sdc(sc)
-    
-
-At this point, you have a Zeppelin Interpreter which will behave like the 
`$MAHOUT_HOME/bin/mahout spark-shell`
-  
-Except, much much more.
-
-At the begining I mentioned a few important features of Zeppelin, that we 
could leverage to use Zeppelin for visualizatoins.
-
-#### Example 1: Visualizing a Matrix (Sample) with R
-
-In Mahout we can use `Matrices.symmetricUniformView` to create a Gaussian 
Matrix. 
-
-We can use `.mapBlock` and some clever code to create a 3D Gausian Matrix. 
-
-We can use `.drmSampleToTsv` to take a sample of the matrix and turn it in to 
a tab seperated string. We take a sample of 
- the matrix because, since we are dealing with "big" data, we wouldn't want to 
try to collect and plot the entire matrix, 
- however, IF we knew we had a small matrix and we DID want to sample the 
entire thing, then we could sample `100.0` e.g. 100%.
- 
-Finally we use `z.put(...)` to put a variable into Zeppelin's `ResourcePool` a 
block of memory shared by all interpreters. 
-
-
-    %sparkMahout
-    
-    val mxRnd3d = Matrices.symmetricUniformView(5000, 3, 1234)
-    val drmRand3d = drmParallelize(mxRnd3d)
-    
-    val drmGauss = drmRand3d.mapBlock() {case (keys, block) =>
-      val blockB = block.like()
-      for (i <- 0 until block.nrow) {
-        val x: Double = block(i, 0)
-        val y: Double = block(i, 1)
-        val z: Double = block(i, 2)
-    
-        blockB(i, 0) = x
-        blockB(i, 1) = y
-        blockB(i, 2) = Math.exp(-((Math.pow(x, 2)) + (Math.pow(y, 2)))/2)
-      }
-      keys -> blockB
-    }
-    
-    resourcePool.put("gaussDrm", drm.drmSampleToTSV(drmGauss, 50.0))
-    
-Here we sample 50% of the matrix and put it in the `ResourcePool` under a 
variable named "gaussDrm".
-
-Now, for the exciting part. Scala doesn't have a lot of great graphing 
utilities. But you know who does? R and Python. So
-instead of trying to akwardly visualize our data using Scala, let's just use R 
and Python. 
-
-We start the Spark R interpreter (we do this because the regular R interpreter 
doesn't have access to the resource pools).
-
-We `z.get` the variable we just put in. 
-
-We use R's `read.table` to read the string- this is very similar to how we 
would read a tsv file in R. 
-
-Then we plot the data using the R `scatterplot3d` package. 
-
-**Note** you may need to install `scatterplot3d`. In Ubuntu, do this with 
`sudo apt-get install r-cran-scatterplot3d`
-
-
-    %spark.r {"imageWidth": "400px"}
-    
-    library(scatterplot3d)
-    
-    
-    gaussStr = z.get("gaussDrm")
-    data <- read.table(text= gaussStr, sep="\t", header=FALSE)
-    
-    scatterplot3d(data, color="green")
-
-![A neat plot](zeppelin3.png)
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/tutorials/misc/mahout-in-zeppelin/zeppelin1.png
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/misc/mahout-in-zeppelin/zeppelin1.png 
b/website-old/docs/tutorials/misc/mahout-in-zeppelin/zeppelin1.png
deleted file mode 100644
index 54fcbc2..0000000
Binary files a/website-old/docs/tutorials/misc/mahout-in-zeppelin/zeppelin1.png 
and /dev/null differ

http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/tutorials/misc/mahout-in-zeppelin/zeppelin2.png
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/misc/mahout-in-zeppelin/zeppelin2.png 
b/website-old/docs/tutorials/misc/mahout-in-zeppelin/zeppelin2.png
deleted file mode 100644
index 724cf7a..0000000
Binary files a/website-old/docs/tutorials/misc/mahout-in-zeppelin/zeppelin2.png 
and /dev/null differ

http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/tutorials/misc/mahout-in-zeppelin/zeppelin3.png
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/misc/mahout-in-zeppelin/zeppelin3.png 
b/website-old/docs/tutorials/misc/mahout-in-zeppelin/zeppelin3.png
deleted file mode 100644
index 2136c5b..0000000
Binary files a/website-old/docs/tutorials/misc/mahout-in-zeppelin/zeppelin3.png 
and /dev/null differ

http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/tutorials/samsara/classify-a-doc-from-the-shell.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/tutorials/samsara/classify-a-doc-from-the-shell.md 
b/website-old/docs/tutorials/samsara/classify-a-doc-from-the-shell.md
deleted file mode 100644
index 8a49903..0000000
--- a/website-old/docs/tutorials/samsara/classify-a-doc-from-the-shell.md
+++ /dev/null
@@ -1,258 +0,0 @@
----
-layout: tutorial
-title: Text Classification Example
-theme:
-    name: mahout2
----
-
-# Building a text classifier in Mahout's Spark Shell
-
-This tutorial will take you through the steps used to train a Multinomial 
Naive Bayes model and create a text classifier based on that model using the 
```mahout spark-shell```. 
-
-## Prerequisites
-This tutorial assumes that you have your Spark environment variables set for 
the ```mahout spark-shell``` see: [Playing with Mahout's 
Shell](http://mahout.apache.org/users/sparkbindings/play-with-shell.html).  As 
well we assume that Mahout is running in cluster mode (i.e. with the 
```MAHOUT_LOCAL``` environment variable **unset**) as we'll be reading and 
writing to HDFS.
-
-## Downloading and Vectorizing the Wikipedia dataset
-*As of Mahout v. 0.10.0, we are still reliant on the MapReduce versions of 
```mahout seqwiki``` and ```mahout seq2sparse``` to extract and vectorize our 
text.  A* [*Spark implementation of 
seq2sparse*](https://issues.apache.org/jira/browse/MAHOUT-1663) *is in the 
works for Mahout v. 0.11.* However, to download the Wikipedia dataset, extract 
the bodies of the documentation, label each document and vectorize the text 
into TF-IDF vectors, we can simpmly run the 
[wikipedia-classifier.sh](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh)
 example.  
-
-    Please select a number to choose the corresponding task to run
-    1. CBayes (may require increased heap space on yarn)
-    2. BinaryCBayes
-    3. clean -- cleans up the work area in /tmp/mahout-work-wiki
-    Enter your choice :
-
-Enter (2). This will download a large recent XML dump of the Wikipedia 
database, into a ```/tmp/mahout-work-wiki``` directory, unzip it and  place it 
into HDFS.  It will run a [MapReduce job to parse the wikipedia 
set](http://mahout.apache.org/users/classification/wikipedia-classifier-example.html),
 extracting and labeling only pages with category tags for [United States] and 
[United Kingdom] (~11600 documents). It will then run ```mahout seq2sparse``` 
to convert the documents into TF-IDF vectors.  The script will also a build and 
test a [Naive Bayes model using 
MapReduce](http://mahout.apache.org/users/classification/bayesian.html).  When 
it is completed, you should see a confusion matrix on your screen.  For this 
tutorial, we will ignore the MapReduce model, and build a new model using Spark 
based on the vectorized text output by ```seq2sparse```.
-
-## Getting Started
-
-Launch the ```mahout spark-shell```.  There is an example script: 
```spark-document-classifier.mscala``` (.mscala denotes a Mahout-Scala script 
which can be run similarly to an R script).   We will be walking through this 
script for this tutorial but if you wanted to simply run the script, you could 
just issue the command: 
-
-    mahout> :load /path/to/mahout/examples/bin/spark-document-classifier.mscala
-
-For now, lets take the script apart piece by piece.  You can cut and paste the 
following code blocks into the ```mahout spark-shell```.
-
-## Imports
-
-Our Mahout Naive Bayes imports:
-
-    import org.apache.mahout.classifier.naivebayes._
-    import org.apache.mahout.classifier.stats._
-    import org.apache.mahout.nlp.tfidf._
-
-Hadoop imports needed to read our dictionary:
-
-    import org.apache.hadoop.io.Text
-    import org.apache.hadoop.io.IntWritable
-    import org.apache.hadoop.io.LongWritable
-
-## Read in our full set from HDFS as vectorized by seq2sparse in 
classify-wikipedia.sh
-
-    val pathToData = "/tmp/mahout-work-wiki/"
-    val fullData = drmDfsRead(pathToData + "wikipediaVecs/tfidf-vectors")
-
-## Extract the category of each observation and aggregate those observations 
by category
-
-    val (labelIndex, aggregatedObservations) = 
SparkNaiveBayes.extractLabelsAndAggregateObservations(
-                                                                 fullData)
-
-## Build a Muitinomial Naive Bayes model and self test on the training set
-
-    val model = SparkNaiveBayes.train(aggregatedObservations, labelIndex, 
false)
-    val resAnalyzer = SparkNaiveBayes.test(model, fullData, false)
-    println(resAnalyzer)
-    
-printing the ```ResultAnalyzer``` will display the confusion matrix.
-
-## Read in the dictionary and document frequency count from HDFS
-    
-    val dictionary = sdc.sequenceFile(pathToData + 
"wikipediaVecs/dictionary.file-0",
-                                      classOf[Text],
-                                      classOf[IntWritable])
-    val documentFrequencyCount = sdc.sequenceFile(pathToData + 
"wikipediaVecs/df-count",
-                                                  classOf[IntWritable],
-                                                  classOf[LongWritable])
-
-    // setup the dictionary and document frequency count as maps
-    val dictionaryRDD = dictionary.map { 
-                                    case (wKey, wVal) => 
wKey.asInstanceOf[Text]
-                                                             .toString() -> 
wVal.get() 
-                                       }
-                                       
-    val documentFrequencyCountRDD = documentFrequencyCount.map {
-                                            case (wKey, wVal) => 
wKey.asInstanceOf[IntWritable]
-                                                                     .get() -> 
wVal.get() 
-                                                               }
-    
-    val dictionaryMap = dictionaryRDD.collect.map(x => x._1.toString -> 
x._2.toInt).toMap
-    val dfCountMap = documentFrequencyCountRDD.collect.map(x => x._1.toInt -> 
x._2.toLong).toMap
-
-## Define a function to tokenize and vectorize new text using our current 
dictionary
-
-For this simple example, our function ```vectorizeDocument(...)``` will 
tokenize a new document into unigrams using native Java String methods and 
vectorize using our dictionary and document frequencies. You could also use a 
[Lucene](https://lucene.apache.org/core/) analyzer for bigrams, trigrams, etc., 
and integrate Apache [Tika](https://tika.apache.org/) to extract text from 
different document types (PDF, PPT, XLS, etc.).  Here, however we will keep it 
simple, stripping and tokenizing our text using regexs and native String 
methods.
-
-    def vectorizeDocument(document: String,
-                            dictionaryMap: Map[String,Int],
-                            dfMap: Map[Int,Long]): Vector = {
-        val wordCounts = document.replaceAll("[^\\p{L}\\p{Nd}]+", " ")
-                                    .toLowerCase
-                                    .split(" ")
-                                    .groupBy(identity)
-                                    .mapValues(_.length)         
-        val vec = new RandomAccessSparseVector(dictionaryMap.size)
-        val totalDFSize = dfMap(-1)
-        val docSize = wordCounts.size
-        for (word <- wordCounts) {
-            val term = word._1
-            if (dictionaryMap.contains(term)) {
-                val tfidf: TermWeight = new TFIDF()
-                val termFreq = word._2
-                val dictIndex = dictionaryMap(term)
-                val docFreq = dfCountMap(dictIndex)
-                val currentTfIdf = tfidf.calculate(termFreq,
-                                                   docFreq.toInt,
-                                                   docSize,
-                                                   totalDFSize.toInt)
-                vec.setQuick(dictIndex, currentTfIdf)
-            }
-        }
-        vec
-    }
-
-## Setup our classifier
-
-    val labelMap = model.labelIndex
-    val numLabels = model.numLabels
-    val reverseLabelMap = labelMap.map(x => x._2 -> x._1)
-    
-    // instantiate the correct type of classifier
-    val classifier = model.isComplementary match {
-        case true => new ComplementaryNBClassifier(model)
-        case _ => new StandardNBClassifier(model)
-    }
-
-## Define an argmax function 
-
-The label with the highest score wins the classification for a given document.
-    
-    def argmax(v: Vector): (Int, Double) = {
-        var bestIdx: Int = Integer.MIN_VALUE
-        var bestScore: Double = Integer.MIN_VALUE.asInstanceOf[Int].toDouble
-        for(i <- 0 until v.size) {
-            if(v(i) > bestScore){
-                bestScore = v(i)
-                bestIdx = i
-            }
-        }
-        (bestIdx, bestScore)
-    }
-
-## Define our TF(-IDF) vector classifier
-
-    def classifyDocument(clvec: Vector) : String = {
-        val cvec = classifier.classifyFull(clvec)
-        val (bestIdx, bestScore) = argmax(cvec)
-        reverseLabelMap(bestIdx)
-    }
-
-## Two sample news articles: United States Football and United Kingdom Football
-    
-    // A random United States football article
-    // 
http://www.reuters.com/article/2015/01/28/us-nfl-superbowl-security-idUSKBN0L12JR20150128
-    val UStextToClassify = new String("(Reuters) - Super Bowl security 
officials acknowledge" +
-        " the NFL championship game represents a high profile target on a 
world stage but are" +
-        " unaware of any specific credible threats against Sunday's showcase. 
In advance of" +
-        " one of the world's biggest single day sporting events, Homeland 
Security Secretary" +
-        " Jeh Johnson was in Glendale on Wednesday to review security 
preparations and tour" +
-        " University of Phoenix Stadium where the Seattle Seahawks and New 
England Patriots" +
-        " will battle. Deadly shootings in Paris and arrest of suspects in 
Belgium, Greece and" +
-        " Germany heightened fears of more attacks around the world and social 
media accounts" +
-        " linked to Middle East militant groups have carried a number of 
threats to attack" +
-        " high-profile U.S. events. There is no specific credible threat, said 
Johnson, who" + 
-        " has appointed a federal coordination team to work with local, state 
and federal" +
-        " agencies to ensure safety of fans, players and other workers 
associated with the" + 
-        " Super Bowl. I'm confident we will have a safe and secure and 
successful event." +
-        " Sunday's game has been given a Special Event Assessment Rating 
(SEAR) 1 rating, the" +
-        " same as in previous years, except for the year after the Sept. 11, 
2001 attacks, when" +
-        " a higher level was declared. But security will be tight and visible 
around Super" +
-        " Bowl-related events as well as during the game itself. All fans will 
pass through" +
-        " metal detectors and pat downs. Over 4,000 private security personnel 
will be deployed" +
-        " and the almost 3,000 member Phoenix police force will be on Super 
Bowl duty. Nuclear" +
-        " device sniffing teams will be deployed and a network of Bio-Watch 
detectors will be" +
-        " set up to provide a warning in the event of a biological attack. The 
Department of" +
-        " Homeland Security (DHS) said in a press release it had held special 
cyber-security" +
-        " and anti-sniper training sessions. A U.S. official said the 
Transportation Security" +
-        " Administration, which is responsible for screening airline 
passengers, will add" +
-        " screeners and checkpoint lanes at airports. Federal air marshals, 
behavior detection" +
-        " officers and dog teams will help to secure transportation systems in 
the area. We" +
-        " will be ramping it (security) up on Sunday, there is no doubt about 
that, said Federal"+
-        " Coordinator Matthew Allen, the DHS point of contact for planning and 
support. I have" +
-        " every confidence the public safety agencies that represented in the 
planning process" +
-        " are going to have their best and brightest out there this weekend 
and we will have" +
-        " a very safe Super Bowl.")
-    
-    // A random United Kingdom football article
-    // 
http://www.reuters.com/article/2015/01/26/manchester-united-swissquote-idUSL6N0V52RZ20150126
-    val UKtextToClassify = new String("(Reuters) - Manchester United have 
signed a sponsorship" +
-        " deal with online financial trading company Swissquote, expanding the 
commercial" +
-        " partnerships that have helped to make the English club one of the 
richest teams in" +
-        " world soccer. United did not give a value for the deal, the club's 
first in the sector," +
-        " but said on Monday it was a multi-year agreement. The Premier League 
club, 20 times" +
-        " English champions, claim to have 659 million followers around the 
globe, making the" +
-        " United name attractive to major brands like Chevrolet cars and 
sportswear group Adidas." +
-        " Swissquote said the global deal would allow it to use United's 
popularity in Asia to" +
-        " help it meet its targets for expansion in China. Among benefits from 
the deal," +
-        " Swissquote's clients will have a chance to meet United players and 
get behind the scenes" +
-        " at the Old Trafford stadium. Swissquote is a Geneva-based online 
trading company that" +
-        " allows retail investors to buy and sell foreign exchange, equities, 
bonds and other asset" +
-        " classes. Like other retail FX brokers, Swissquote was left nursing 
losses on the Swiss" +
-        " franc after Switzerland's central bank stunned markets this month by 
abandoning its cap" +
-        " on the currency. The fallout from the abrupt move put rival and West 
Ham United shirt" +
-        " sponsor Alpari UK into administration. Swissquote itself was forced 
to book a 25 million" +
-        " Swiss francs ($28 million) provision for its clients who were left 
out of pocket" +
-        " following the franc's surge. United's ability to grow revenues off 
the pitch has made" +
-        " them the second richest club in the world behind Spain's Real 
Madrid, despite a" +
-        " downturn in their playing fortunes. United Managing Director Richard 
Arnold said" +
-        " there was still lots of scope for United to develop sponsorships in 
other areas of" +
-        " business. The last quoted statistics that we had showed that of the 
top 25 sponsorship" +
-        " categories, we were only active in 15 of those, Arnold told Reuters. 
I think there is a" +
-        " huge potential still for the club, and the other thing we have seen 
is there is very" +
-        " significant growth even within categories. United have endured a 
tricky transition" +
-        " following the retirement of manager Alex Ferguson in 2013, finishing 
seventh in the" +
-        " Premier League last season and missing out on a place in the 
lucrative Champions League." +
-        " ($1 = 0.8910 Swiss francs) (Writing by Neil Maidment, additional 
reporting by Jemima" + 
-        " Kelly; editing by Keith Weir)")
-
-## Vectorize and classify our documents
-
-    val usVec = vectorizeDocument(UStextToClassify, dictionaryMap, dfCountMap)
-    val ukVec = vectorizeDocument(UKtextToClassify, dictionaryMap, dfCountMap)
-    
-    println("Classifying the news article about superbowl security (united 
states)")
-    classifyDocument(usVec)
-    
-    println("Classifying the news article about Manchester United (united 
kingdom)")
-    classifyDocument(ukVec)
-
-## Tie everything together in a new method to classify text 
-    
-    def classifyText(txt: String): String = {
-        val v = vectorizeDocument(txt, dictionaryMap, dfCountMap)
-        classifyDocument(v)
-    }
-
-## Now we can simply call our classifyText(...) method on any String
-
-    classifyText("Hello world from Queens")
-    classifyText("Hello world from London")
-    
-## Model persistance
-
-You can save the model to HDFS:
-
-    model.dfsWrite("/path/to/model")
-    
-And retrieve it with:
-
-    val model =  NBModel.dfsRead("/path/to/model")
-
-The trained model can now be embedded in an external application.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/tutorials/samsara/play-with-shell.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/samsara/play-with-shell.md 
b/website-old/docs/tutorials/samsara/play-with-shell.md
deleted file mode 100644
index a01f23c..0000000
--- a/website-old/docs/tutorials/samsara/play-with-shell.md
+++ /dev/null
@@ -1,199 +0,0 @@
----
-layout: tutorial
-title: Mahout Samsara In Core
-theme:
-    name: mahout2
----
-# Playing with Mahout's Spark Shell 
-
-This tutorial will show you how to play with Mahout's scala DSL for linear 
algebra and its Spark shell. **Please keep in mind that this code is still in a 
very early experimental stage**.
-
-_(Edited for 0.10.2)_
-
-## Intro
-
-We'll use an excerpt of a publicly available [dataset about 
cereals](http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html). The dataset 
tells the protein, fat, carbohydrate and sugars (in milligrams) contained in a 
set of cereals, as well as a customer rating for the cereals. Our aim for this 
example is to fit a linear model which infers the customer rating from the 
ingredients.
-
-
-Name                    | protein | fat | carbo | sugars | rating
-:-----------------------|:--------|:----|:------|:-------|:---------
-Apple Cinnamon Cheerios | 2       | 2   | 10.5  | 10     | 29.509541
-Cap'n'Crunch            | 1       | 2   | 12    | 12     | 18.042851  
-Cocoa Puffs             | 1       | 1   | 12    | 13     | 22.736446
-Froot Loops             | 2       |    1   | 11    | 13     | 32.207582  
-Honey Graham Ohs        | 1       |    2   | 12    | 11     | 21.871292
-Wheaties Honey Gold     | 2       | 1   | 16    |  8     | 36.187559  
-Cheerios                | 6       |    2   | 17    |  1     | 50.764999
-Clusters                | 3       |    2   | 13    |  7     | 40.400208
-Great Grains Pecan      | 3       | 3   | 13    |  4     | 45.811716  
-
-
-## Installing Mahout & Spark on your local machine
-
-We describe how to do a quick toy setup of Spark & Mahout on your local 
machine, so that you can run this example and play with the shell. 
-
- 1. Download [Apache Spark 
1.6.2](http://d3kbcqa49mib13.cloudfront.net/spark-1.6.2-bin-hadoop2.6.tgz) and 
unpack the archive file
- 1. Change to the directory where you unpacked Spark and type ```sbt/sbt 
assembly``` to build it
- 1. Create a directory for Mahout somewhere on your machine, change to there 
and checkout the master branch of Apache Mahout from GitHub ```git clone 
https://github.com/apache/mahout mahout```
- 1. Change to the ```mahout``` directory and build mahout using ```mvn 
-DskipTests clean install```
- 
-## Starting Mahout's Spark shell
-
- 1. Goto the directory where you unpacked Spark and type 
```sbin/start-all.sh``` to locally start Spark
- 1. Open a browser, point it to 
[http://localhost:8080/](http://localhost:8080/) to check whether Spark 
successfully started. Copy the url of the spark master at the top of the page 
(it starts with **spark://**)
- 1. Define the following environment variables: <pre class="codehilite">export 
MAHOUT_HOME=[directory into which you checked out Mahout]
-export SPARK_HOME=[directory where you unpacked Spark]
-export MASTER=[url of the Spark master]
-</pre>
- 1. Finally, change to the directory where you unpacked Mahout and type 
```bin/mahout spark-shell```, 
-you should see the shell starting and get the prompt ```mahout> ```. Check 
-[FAQ](http://mahout.apache.org/users/sparkbindings/faq.html) for further 
troubleshooting.
-
-## Implementation
-
-We'll use the shell to interactively play with the data and incrementally 
implement a simple [linear 
regression](https://en.wikipedia.org/wiki/Linear_regression) algorithm. Let's 
first load the dataset. Usually, we wouldn't need Mahout unless we processed a 
large dataset stored in a distributed filesystem. But for the sake of this 
example, we'll use our tiny toy dataset and "pretend" it was too big to fit 
onto a single machine.
-
-*Note: You can incrementally follow the example by copy-and-pasting the code 
into your running Mahout shell.*
-
-Mahout's linear algebra DSL has an abstraction called *DistributedRowMatrix 
(DRM)* which models a matrix that is partitioned by rows and stored in the 
memory of a cluster of machines. We use ```dense()``` to create a dense 
in-memory matrix from our toy dataset and use ```drmParallelize``` to load it 
into the cluster, "mimicking" a large, partitioned dataset.
-
-<div class="codehilite"><pre>
-val drmData = drmParallelize(dense(
-  (2, 2, 10.5, 10, 29.509541),  // Apple Cinnamon Cheerios
-  (1, 2, 12,   12, 18.042851),  // Cap'n'Crunch
-  (1, 1, 12,   13, 22.736446),  // Cocoa Puffs
-  (2, 1, 11,   13, 32.207582),  // Froot Loops
-  (1, 2, 12,   11, 21.871292),  // Honey Graham Ohs
-  (2, 1, 16,   8,  36.187559),  // Wheaties Honey Gold
-  (6, 2, 17,   1,  50.764999),  // Cheerios
-  (3, 2, 13,   7,  40.400208),  // Clusters
-  (3, 3, 13,   4,  45.811716)), // Great Grains Pecan
-  numPartitions = 2);
-</pre></div>
-
-Have a look at this matrix. The first four columns represent the ingredients 
-(our features) and the last column (the rating) is the target variable for 
-our regression. [Linear 
regression](https://en.wikipedia.org/wiki/Linear_regression) 
-assumes that the **target variable** `\(\mathbf{y}\)` is generated by the 
-linear combination of **the feature matrix** `\(\mathbf{X}\)` with the 
-**parameter vector** `\(\boldsymbol{\beta}\)` plus the
- **noise** `\(\boldsymbol{\varepsilon}\)`, summarized in the formula 
-`\(\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon}\)`. 
-Our goal is to find an estimate of the parameter vector 
-`\(\boldsymbol{\beta}\)` that explains the data very well.
-
-As a first step, we extract `\(\mathbf{X}\)` and `\(\mathbf{y}\)` from our 
data matrix. We get *X* by slicing: we take all rows (denoted by ```::```) and 
the first four columns, which have the ingredients in milligrams as content. 
Note that the result is again a DRM. The shell will not execute this code yet, 
it saves the history of operations and defers the execution until we really 
access a result. **Mahout's DSL automatically optimizes and parallelizes all 
operations on DRMs and runs them on Apache Spark.**
-
-<div class="codehilite"><pre>
-val drmX = drmData(::, 0 until 4)
-</pre></div>
-
-Next, we extract the target variable vector *y*, the fifth column of the data 
matrix. We assume this one fits into our driver machine, so we fetch it into 
memory using ```collect```:
-
-<div class="codehilite"><pre>
-val y = drmData.collect(::, 4)
-</pre></div>
-
-Now we are ready to think about a mathematical way to estimate the parameter 
vector *Î²*. A simple textbook approach is [ordinary least squares 
(OLS)](https://en.wikipedia.org/wiki/Ordinary_least_squares), which minimizes 
the sum of residual squares between the true target variable and the prediction 
of the target variable. In OLS, there is even a closed form expression for 
estimating `\(\boldsymbol{\beta}\)` as 
-`\(\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1}\mathbf{X}^{\top}\mathbf{y}\)`.
-
-The first thing which we compute for this is  
`\(\mathbf{X}^{\top}\mathbf{X}\)`. The code for doing this in Mahout's scala 
DSL maps directly to the mathematical formula. The operation ```.t()``` 
transposes a matrix and analogous to R ```%*%``` denotes matrix multiplication.
-
-<div class="codehilite"><pre>
-val drmXtX = drmX.t %*% drmX
-</pre></div>
-
-The same is true for computing `\(\mathbf{X}^{\top}\mathbf{y}\)`. We can 
simply type the math in scala expressions into the shell. Here, *X* lives in 
the cluster, while is *y* in the memory of the driver, and the result is a DRM 
again.
-<div class="codehilite"><pre>
-val drmXty = drmX.t %*% y
-</pre></div>
-
-We're nearly done. The next step we take is to fetch 
`\(\mathbf{X}^{\top}\mathbf{X}\)` and 
-`\(\mathbf{X}^{\top}\mathbf{y}\)` into the memory of our driver machine (we 
are targeting 
-features matrices that are tall and skinny , 
-so we can assume that `\(\mathbf{X}^{\top}\mathbf{X}\)` is small enough 
-to fit in). Then, we provide them to an in-memory solver (Mahout provides 
-the an analog to R's ```solve()``` for that) which computes ```beta```, our 
-OLS estimate of the parameter vector `\(\boldsymbol{\beta}\)`.
-
-<div class="codehilite"><pre>
-val XtX = drmXtX.collect
-val Xty = drmXty.collect(::, 0)
-
-val beta = solve(XtX, Xty)
-</pre></div>
-
-That's it! We have a implemented a distributed linear regression algorithm 
-on Apache Spark. I hope you agree that we didn't have to worry a lot about 
-parallelization and distributed systems. The goal of Mahout's linear algebra 
-DSL is to abstract away the ugliness of programming a distributed system 
-as much as possible, while still retaining decent performance and 
-scalability.
-
-We can now check how well our model fits its training data. 
-First, we multiply the feature matrix `\(\mathbf{X}\)` by our estimate of 
-`\(\boldsymbol{\beta}\)`. Then, we look at the difference (via L2-norm) of 
-the target variable `\(\mathbf{y}\)` to the fitted target variable:
-
-<div class="codehilite"><pre>
-val yFitted = (drmX %*% beta).collect(::, 0)
-(y - yFitted).norm(2)
-</pre></div>
-
-We hope that we could show that Mahout's shell allows people to interactively 
and incrementally write algorithms. We have entered a lot of individual 
commands, one-by-one, until we got the desired results. We can now refactor a 
little by wrapping our statements into easy-to-use functions. The definition of 
functions follows standard scala syntax. 
-
-We put all the commands for ordinary least squares into a function ```ols```. 
-
-<div class="codehilite"><pre>
-def ols(drmX: DrmLike[Int], y: Vector) = 
-  solve(drmX.t %*% drmX, drmX.t %*% y)(::, 0)
-
-</pre></div>
-
-Note that DSL declares implicit `collect` if coersion rules require an in-core 
argument. Hence, we can simply
-skip explicit `collect`s. 
-
-Next, we define a function ```goodnessOfFit``` that tells how well a model 
fits the target variable:
-
-<div class="codehilite"><pre>
-def goodnessOfFit(drmX: DrmLike[Int], beta: Vector, y: Vector) = {
-  val fittedY = (drmX %*% beta).collect(::, 0)
-  (y - fittedY).norm(2)
-}
-</pre></div>
-
-So far we have left out an important aspect of a standard linear regression 
-model. Usually there is a constant bias term added to the model. Without 
-that, our model always crosses through the origin and we only learn the 
-right angle. An easy way to add such a bias term to our model is to add a 
-column of ones to the feature matrix `\(\mathbf{X}\)`. 
-The corresponding weight in the parameter vector will then be the bias term.
-
-Here is how we add a bias column:
-
-<div class="codehilite"><pre>
-val drmXwithBiasColumn = drmX cbind 1
-</pre></div>
-
-Now we can give the newly created DRM ```drmXwithBiasColumn``` to our model 
fitting method ```ols``` and see how well the resulting model fits the training 
data with ```goodnessOfFit```. You should see a large improvement in the result.
-
-<div class="codehilite"><pre>
-val betaWithBiasTerm = ols(drmXwithBiasColumn, y)
-goodnessOfFit(drmXwithBiasColumn, betaWithBiasTerm, y)
-</pre></div>
-
-As a further optimization, we can make use of the DSL's caching functionality. 
We use ```drmXwithBiasColumn``` repeatedly  as input to a computation, so it 
might be beneficial to cache it in memory. This is achieved by calling 
```checkpoint()```. In the end, we remove it from the cache with uncache:
-
-<div class="codehilite"><pre>
-val cachedDrmX = drmXwithBiasColumn.checkpoint()
-
-val betaWithBiasTerm = ols(cachedDrmX, y)
-val goodness = goodnessOfFit(cachedDrmX, betaWithBiasTerm, y)
-
-cachedDrmX.uncache()
-
-goodness
-</pre></div>
-
-
-Liked what you saw? Checkout Mahout's overview for the [Scala and Spark 
bindings]({{ BASE_PATH }}/distributed/spark-bindings).
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/tutorials/samsara/playing-with-samsara-flink-batch.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/tutorials/samsara/playing-with-samsara-flink-batch.md 
b/website-old/docs/tutorials/samsara/playing-with-samsara-flink-batch.md
deleted file mode 100644
index 752f01c..0000000
--- a/website-old/docs/tutorials/samsara/playing-with-samsara-flink-batch.md
+++ /dev/null
@@ -1,111 +0,0 @@
----
-layout: tutorial
-title: 
-theme:
-   name: retro-mahout
----
-
-## Getting Started 
-
-To get started, add the following dependency to the pom:
-
-    <dependency>
-      <groupId>org.apache.mahout</groupId>
-      <artifactId>mahout-flink_2.10</artifactId>
-      <version>0.12.0</version>
-    </dependency>
-
-Here is how to use the Flink backend:
-
-       import org.apache.flink.api.scala._
-       import org.apache.mahout.math.drm._
-       import org.apache.mahout.math.drm.RLikeDrmOps._
-       import org.apache.mahout.flinkbindings._
-
-       object ReadCsvExample {
-
-         def main(args: Array[String]): Unit = {
-           val filePath = "path/to/the/input/file"
-
-           val env = ExecutionEnvironment.getExecutionEnvironment
-           implicit val ctx = new FlinkDistributedContext(env)
-
-           val drm = readCsv(filePath, delim = "\t", comment = "#")
-           val C = drm.t %*% drm
-           println(C.collect)
-         }
-
-       }
-
-## Current Status
-
-The top JIRA for Flink backend is 
[MAHOUT-1570](https://issues.apache.org/jira/browse/MAHOUT-1570) which has been 
fully implemented.
-
-### Implemented
-
-* [MAHOUT-1701](https://issues.apache.org/jira/browse/MAHOUT-1701) Mahout DSL 
for Flink: implement AtB ABt and AtA operators
-* [MAHOUT-1702](https://issues.apache.org/jira/browse/MAHOUT-1702) implement 
element-wise operators (like `A + 2` or `A + B`) 
-* [MAHOUT-1703](https://issues.apache.org/jira/browse/MAHOUT-1703) implement 
`cbind` and `rbind`
-* [MAHOUT-1709](https://issues.apache.org/jira/browse/MAHOUT-1709) implement 
slicing (like `A(1 to 10, ::)`)
-* [MAHOUT-1710](https://issues.apache.org/jira/browse/MAHOUT-1710) implement 
right in-core matrix multiplication (`A %*% B` when `B` is in-core) 
-* [MAHOUT-1711](https://issues.apache.org/jira/browse/MAHOUT-1711) implement 
broadcasting
-* [MAHOUT-1712](https://issues.apache.org/jira/browse/MAHOUT-1712) implement 
operators `At`, `Ax`, `Atx` - `Ax` and `At` are implemented
-* [MAHOUT-1734](https://issues.apache.org/jira/browse/MAHOUT-1734) implement 
I/O - should be able to read results of Flink bindings
-* [MAHOUT-1747](https://issues.apache.org/jira/browse/MAHOUT-1747) add support 
for different types of indexes (String, long, etc) - now supports `Int`, `Long` 
and `String`
-* [MAHOUT-1748](https://issues.apache.org/jira/browse/MAHOUT-1748) switch to 
Flink Scala API 
-* [MAHOUT-1749](https://issues.apache.org/jira/browse/MAHOUT-1749) Implement 
`Atx`
-* [MAHOUT-1750](https://issues.apache.org/jira/browse/MAHOUT-1750) Implement 
`ABt`
-* [MAHOUT-1751](https://issues.apache.org/jira/browse/MAHOUT-1751) Implement 
`AtA` 
-* [MAHOUT-1755](https://issues.apache.org/jira/browse/MAHOUT-1755) Flush 
intermediate results to FS - Flink, unlike Spark, does not store intermediate 
results in memory.
-* [MAHOUT-1764](https://issues.apache.org/jira/browse/MAHOUT-1764) Add 
standard backend tests for Flink
-* [MAHOUT-1765](https://issues.apache.org/jira/browse/MAHOUT-1765) Add 
documentation about Flink backend
-* [MAHOUT-1776](https://issues.apache.org/jira/browse/MAHOUT-1776) Refactor 
common Engine agnostic classes to Math-Scala module
-* [MAHOUT-1777](https://issues.apache.org/jira/browse/MAHOUT-1777) move 
HDFSUtil classes into the HDFS module
-* [MAHOUT-1804](https://issues.apache.org/jira/browse/MAHOUT-1804) Implement 
drmParallelizeWithRowLabels(..) in Flink
-* [MAHOUT-1805](https://issues.apache.org/jira/browse/MAHOUT-1805) Implement 
allReduceBlock(..) in Flink bindings
-* [MAHOUT-1809](https://issues.apache.org/jira/browse/MAHOUT-1809) Failing 
tests in flin-bindings: dals and dspca
-* [MAHOUT-1810](https://issues.apache.org/jira/browse/MAHOUT-1810) Failing 
test in flink-bindings: A + B Identically partitioned (mapBlock Checkpointing 
issue)
-* [MAHOUT-1812](https://issues.apache.org/jira/browse/MAHOUT-1812) Implement 
drmParallelizeWithEmptyLong(..) in flink bindings
-* [MAHOUT-1814](https://issues.apache.org/jira/browse/MAHOUT-1814) Implement 
drm2intKeyed in flink bindings
-* [MAHOUT-1815](https://issues.apache.org/jira/browse/MAHOUT-1815) 
dsqDist(X,Y) and dsqDist(X) failing in flink tests
-* [MAHOUT-1816](https://issues.apache.org/jira/browse/MAHOUT-1816) Implement 
newRowCardinality in CheckpointedFlinkDrm
-* [MAHOUT-1817](https://issues.apache.org/jira/browse/MAHOUT-1817) Implement 
caching in Flink Bindings
-* [MAHOUT-1818](https://issues.apache.org/jira/browse/MAHOUT-1818) dals test 
failing in Flink Bindings
-* [MAHOUT-1819](https://issues.apache.org/jira/browse/MAHOUT-1819) Set the 
default Parallelism for Flink execution in FlinkDistributedContext
-* [MAHOUT-1820](https://issues.apache.org/jira/browse/MAHOUT-1820) Add a 
method to generate Tuple<PartitionId, Partition elements count>> to support 
Flink backend
-* [MAHOUT-1821](https://issues.apache.org/jira/browse/MAHOUT-1821) Use a 
mahout-flink-conf.yaml configuration file for Mahout specific Flink 
configuration
-* [MAHOUT-1822](https://issues.apache.org/jira/browse/MAHOUT-1822) Update 
NOTICE.txt, License.txt to add Apache Flink
-* [MAHOUT-1823](https://issues.apache.org/jira/browse/MAHOUT-1823) Modify 
MahoutFlinkTestSuite to implement FlinkTestBase
-* [MAHOUT-1824](https://issues.apache.org/jira/browse/MAHOUT-1824) Optimize 
FlinkOpAtA to use upper triangular matrices
-* [MAHOUT-1825](https://issues.apache.org/jira/browse/MAHOUT-1825) Add List of 
Flink algorithms to Mahout wiki page
-
-### Tests 
-
-There is a set of standard tests that all engines should pass (see 
[MAHOUT-1764](https://issues.apache.org/jira/browse/MAHOUT-1764)).  
-
-* `DistributedDecompositionsSuite` 
-* `DrmLikeOpsSuite` 
-* `DrmLikeSuite` 
-* `RLikeDrmOpsSuite` 
-
-
-These are Flink-backend specific tests, e.g.
-
-* `DrmLikeOpsSuite` for operations like `norm`, `rowSums`, `rowMeans`
-* `RLikeOpsSuite` for basic LA like `A.t %*% A`, `A.t %*% x`, etc
-* `LATestSuite` tests for specific operators like `AtB`, `Ax`, etc
-* `UseCasesSuite` has more complex examples, like power iteration, ridge 
regression, etc
-
-## Environment 
-
-For development the minimal supported configuration is 
-
-* [JDK 
1.7](http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html)
-* [Scala 2.10]
-
-When using mahout, please import the following modules: 
-
-* `mahout-math`
-* `mahout-math-scala`
-* `mahout-flink_2.10`
-*
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/tutorials/samsara/spark-naive-bayes.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/samsara/spark-naive-bayes.md 
b/website-old/docs/tutorials/samsara/spark-naive-bayes.md
deleted file mode 100644
index 3c24fff..0000000
--- a/website-old/docs/tutorials/samsara/spark-naive-bayes.md
+++ /dev/null
@@ -1,132 +0,0 @@
----
-layout: tutorial
-title: Spark Naive Bayes
-theme:
-    name: retro-mahout
----
-
-# Spark Naive Bayes
-
-
-## Intro
-
-Mahout currently has two flavors of Naive Bayes.  The first is standard 
Multinomial Naive Bayes. The second is an implementation of Transformed 
Weight-normalized Complement Naive Bayes as introduced by Rennie et al. 
[[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf). We refer to 
the former as Bayes and the latter as CBayes.
-
-Where Bayes has long been a standard in text classification, CBayes is an 
extension of Bayes that performs particularly well on datasets with skewed 
classes and has been shown to be competitive with algorithms of higher 
complexity such as Support Vector Machines. 
-
-
-## Implementations
-The mahout `math-scala` library has an implemetation of both Bayes and CBayes 
which is further optimized in the `spark` module. Currently the Spark optimized 
version provides CLI drivers for training and testing. Mahout Spark-Naive-Bayes 
models can also be trained, tested and saved to the filesystem from the Mahout 
Spark Shell. 
-
-## Preprocessing and Algorithm
-
-As described in 
[[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf) Mahout Naive 
Bayes is broken down into the following steps (assignments are over all 
possible index values):  
-
-- Let `\(\vec{d}=(\vec{d_1},...,\vec{d_n})\)` be a set of documents; 
`\(d_{ij}\)` is the count of word `\(i\)` in document `\(j\)`.
-- Let `\(\vec{y}=(y_1,...,y_n)\)` be their labels.
-- Let `\(\alpha_i\)` be a smoothing parameter for all words in the vocabulary; 
let `\(\alpha=\sum_i{\alpha_i}\)`. 
-- **Preprocessing**(via seq2Sparse) TF-IDF transformation and L2 length 
normalization of `\(\vec{d}\)`
-    1. `\(d_{ij} = \sqrt{d_{ij}}\)` 
-    2. `\(d_{ij} = 
d_{ij}\left(\log{\frac{\sum_k1}{\sum_k\delta_{ik}+1}}+1\right)\)` 
-    3. `\(d_{ij} =\frac{d_{ij}}{\sqrt{\sum_k{d_{kj}^2}}}\)` 
-- **Training: Bayes**`\((\vec{d},\vec{y})\)` calculate term weights 
`\(w_{ci}\)` as:
-    1. `\(\hat\theta_{ci}=\frac{d_{ic}+\alpha_i}{\sum_k{d_{kc}}+\alpha}\)`
-    2. `\(w_{ci}=\log{\hat\theta_{ci}}\)`
-- **Training: CBayes**`\((\vec{d},\vec{y})\)` calculate term weights 
`\(w_{ci}\)` as:
-    1. `\(\hat\theta_{ci} = \frac{\sum_{j:y_j\neq 
c}d_{ij}+\alpha_i}{\sum_{j:y_j\neq c}{\sum_k{d_{kj}}}+\alpha}\)`
-    2. `\(w_{ci}=-\log{\hat\theta_{ci}}\)`
-    3. `\(w_{ci}=\frac{w_{ci}}{\sum_i \lvert w_{ci}\rvert}\)`
-- **Label Assignment/Testing:**
-    1. Let `\(\vec{t}= (t_1,...,t_n)\)` be a test document; let `\(t_i\)` be 
the count of the word `\(t\)`.
-    2. Label the document according to `\(l(t)=\arg\max_c \sum\limits_{i} t_i 
w_{ci}\)`
-
-As we can see, the main difference between Bayes and CBayes is the weight 
calculation step.  Where Bayes weighs terms more heavily based on the 
likelihood that they belong to class `\(c\)`, CBayes seeks to maximize term 
weights on the likelihood that they do not belong to any other class.  
-
-## Running from the command line
-
-Mahout provides CLI drivers for all above steps.  Here we will give a simple 
overview of Mahout CLI commands used to preprocess the data, train the model 
and assign labels to the training set. An [example 
script](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh)
 is given for the full process from data acquisition through classification of 
the classic [20 Newsgroups 
corpus](https://mahout.apache.org/users/classification/twenty-newsgroups.html). 
 
-
-- **Preprocessing:**
-For a set of Sequence File Formatted documents in PATH_TO_SEQUENCE_FILES the 
[mahout 
seq2sparse](https://mahout.apache.org/users/basics/creating-vectors-from-text.html)
 command performs the TF-IDF transformations (-wt tfidf option) and L2 length 
normalization (-n 2 option) as follows:
-
-        $ mahout seq2sparse 
-          -i ${PATH_TO_SEQUENCE_FILES} 
-          -o ${PATH_TO_TFIDF_VECTORS} 
-          -nv 
-          -n 2
-          -wt tfidf
-
-- **Training:**
-The model is then trained using `mahout spark-trainnb`.  The default is to 
train a Bayes model. The -c option is given to train a CBayes model:
-
-        $ mahout spark-trainnb
-          -i ${PATH_TO_TFIDF_VECTORS} 
-          -o ${PATH_TO_MODEL}
-          -ow 
-          -c
-
-- **Label Assignment/Testing:**
-Classification and testing on a holdout set can then be performed via `mahout 
spark-testnb`. Again, the -c option indicates that the model is CBayes:
-
-        $ mahout spark-testnb 
-          -i ${PATH_TO_TFIDF_TEST_VECTORS}
-          -m ${PATH_TO_MODEL} 
-          -c 
-
-## Command line options
-
-- **Preprocessing:** *note: still reliant on MapReduce seq2sparse* 
-  
-  Only relevant parameters used for Bayes/CBayes as detailed above are shown. 
Several other transformations can be performed by `mahout seq2sparse` and used 
as input to Bayes/CBayes.  For a full list of `mahout seq2Sparse` options see 
the [Creating vectors from 
text](https://mahout.apache.org/users/basics/creating-vectors-from-text.html) 
page.
-
-        $ mahout seq2sparse                         
-          --output (-o) output             The directory pathname for output.  
      
-          --input (-i) input               Path to job input directory.        
      
-          --weight (-wt) weight            The kind of weight to use. 
Currently TF   
-                                               or TFIDF. Default: TFIDF        
          
-          --norm (-n) norm                 The norm to use, expressed as 
either a    
-                                               float or "INF" if you want to 
use the     
-                                               Infinite norm.  Must be greater 
or equal  
-                                               to 0.  The default is not to 
normalize    
-          --overwrite (-ow)                If set, overwrite the output 
directory    
-          --sequentialAccessVector (-seq)  (Optional) Whether output vectors 
should  
-                                               be SequentialAccessVectors. If 
set true   
-                                               else false                      
          
-          --namedVector (-nv)              (Optional) Whether output vectors 
should  
-                                               be NamedVectors. If set true 
else false   
-
-- **Training:**
-
-        $ mahout spark-trainnb
-          --input (-i) input               Path to job input directory.        
         
-          --output (-o) output             The directory pathname for output.  
         
-          --trainComplementary (-c)        Train complementary? Default is 
false.
-          --master (-ma)                   Spark Master URL (optional). 
Default: "local".
-                                               Note that you can specify the 
number of 
-                                               cores to get a performance 
improvement, 
-                                               for example "local[4]"
-          --help (-h)                      Print out help                      
         
-
-- **Testing:**
-
-        $ mahout spark-testnb   
-          --input (-i) input               Path to job input directory.        
          
-          --model (-m) model               The path to the model built during 
training.   
-          --testComplementary (-c)         Test complementary? Default is 
false.                          
-          --master (-ma)                   Spark Master URL (optional). 
Default: "local". 
-                                               Note that you can specify the 
number of 
-                                               cores to get a performance 
improvement, 
-                                               for example "local[4]"          
              
-          --help (-h)                      Print out help                      
          
-
-## Examples
-1. [20 Newsgroups 
classification](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh)
-2. [Document classification with Naive Bayes in the Mahout 
shell](https://github.com/apache/mahout/blob/master/examples/bin/spark-document-classifier.mscala)
-        
- 
-## References
-
-[1]: Jason D. M. Rennie, Lawerence Shih, Jamie Teevan, David Karger (2003). 
[Tackling the Poor Assumptions of Naive Bayes Text 
Classifiers](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf). 
Proceedings of the Twentieth International Conference on Machine Learning 
(ICML-2003).
-
-
-

http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/front/404.html
----------------------------------------------------------------------
diff --git a/website-old/front/404.html b/website-old/front/404.html
deleted file mode 100755
index 6904bcd..0000000
--- a/website-old/front/404.html
+++ /dev/null
@@ -1 +0,0 @@
-Sorry this page does not exist =(

http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/front/Gemfile
----------------------------------------------------------------------
diff --git a/website-old/front/Gemfile b/website-old/front/Gemfile
deleted file mode 100755
index 301d29c..0000000
--- a/website-old/front/Gemfile
+++ /dev/null
@@ -1,5 +0,0 @@
-source "https://rubygems.org";
-
-gem "jekyll", "~> 3.1"
-gem "jekyll-sitemap"
-gem "pygments.rb"

http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/front/README.md
----------------------------------------------------------------------
diff --git a/website-old/front/README.md b/website-old/front/README.md
deleted file mode 100755
index 2283543..0000000
--- a/website-old/front/README.md
+++ /dev/null
@@ -1,5 +0,0 @@
-### Landing Page
-
-Stil needs a lot of work...
-
-![landing](screenshots/landing.png)
\ No newline at end of file

[20/52] [partial] mahout git commit: MAHOUT-1981 Merged site updates, fixed navbars, Mathjax

Reply via email to