[23/51] [partial] mahout git commit: New Website courtesy of startbootstrap.com

rawkintrevo Fri, 01 Dec 2017 22:09:11 -0800

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/map-reduce/clustering/streaming-k-means.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/algorithms/map-reduce/clustering/streaming-k-means.md 
b/website-old/docs/algorithms/map-reduce/clustering/streaming-k-means.md
new file mode 100644
index 0000000..599921f
--- /dev/null
+++ b/website-old/docs/algorithms/map-reduce/clustering/streaming-k-means.md
@@ -0,0 +1,174 @@
+---
+layout: algorithm
+title: (Deprecated)  Spectral Clustering
+theme:
+   name: retro-mahout
+---
+
+# *StreamingKMeans* algorithm 
+
+The *StreamingKMeans* algorithm is a variant of Algorithm 1 from [Shindler et 
al][1] and consists of two steps:
+
+ 1. Streaming step 
+ 2. BallKMeans step. 
+
+The streaming step is a randomized algorithm that makes one pass through the 
data and 
+produces as many centroids as it determines is optimal. This step can be 
viewed as 
+a preparatory dimensionality reduction. If the size of the data stream is *n* 
and the 
+expected number of clusters is *k*, the streaming step will produce roughly 
*k\*log(n)* 
+clusters that will be passed on to the BallKMeans step which will further 
reduce the 
+number of clusters down to *k*. BallKMeans is a randomized Lloyd-type 
algorithm that
+has been studied in detail, see [Ostrovsky et al][2].
+
+## Streaming step
+
+---
+
+### Overview
+
+The streaming step is a derivative of the streaming 
+portion of Algorithm 1 in [Shindler et al][1]. The main difference between the 
two is that 
+Algorithm 1 of [Shindler et al][1] assumes 
+the knowledge of the size of the data stream and uses it to set a key 
parameter 
+for the algorithm. More precisely, the initial *distanceCutoff* (defined 
below), which is 
+denoted by *f* in [Shindler et al][1], is set to *1/(k(1+log(n))*. The 
*distanceCutoff* influences the number of clusters that the algorithm 
+will produce. 
+In contrast, Mahout implementation does not require the knowledge of the size 
of the 
+data stream. Instead, it dynamically re-evaluates the parameters that depend 
on the size 
+of the data stream at runtime as more and more data is processed. In 
particular, 
+the parameter *numClusters* (defined below) changes its value as the data is 
processed.   
+
+###Parameters
+
+ - **numClusters** (int): Conceptually, *numClusters* represents the 
algorithm's guess at the optimal 
+number of clusters it is shooting for. In particular, *numClusters* will 
increase at run 
+time as more and more data is processed. Note that â¢numClustersâ¢ is not 
the number of clusters that the algorithm will produce. Also, *numClusters* 
should not be set to the final number of clusters that we expect to receive as 
the output of *StreamingKMeans*. 
+ - **distanceCutoff** (double): a parameter representing the value of the 
distance between a point and 
+its closest centroid after which
+the new point will definitely be assigned to a new cluster. *distanceCutoff* 
can be thought 
+of as an estimate of the variable *f* from Shindler et al. The default initial 
value for 
+*distanceCutoff* is *1.0/numClusters* and *distanceCutoff* grows as a 
geometric progression with 
+common ratio *beta* (see below).    
+ - **beta** (double): a constant parameter that controls the growth of 
*distanceCutoff*. If the initial setting of *distanceCutoff* is *d0*, 
*distanceCutoff* will grow as the geometric progression with initial term *d0* 
and common ratio *beta*. The default value for *beta* is 1.3. 
+ - **clusterLogFactor** (double): a constant parameter such that 
*clusterLogFactor* *log(numProcessedPoints)* is the runtime estimate of the 
number of clusters to be produced by the streaming step. If the final number of 
clusters (that we expect *StreamingKMeans* to output) is *k*, 
*clusterLogFactor* can be set to *k*.  
+ - **clusterOvershoot** (double): a constant multiplicative slack factor that 
slows down the collapsing of clusters. The default value is 2. 
+
+
+###Algorithm 
+
+The algorithm processes the data one-by-one and makes only one pass through 
the data.
+The first point from the data stream will form the centroid of the first 
cluster (this designation may change as more points are processed). Suppose 
there are *r* clusters at one point and a new point *p* is being processed. The 
new point can either be added to one of the existing *r* clusters or become a 
new cluster. To decide:
+
+ - let *c* be the closest cluster to point *p*
+ - let *d* be the distance between *c* and *p*
+ - if *d > distanceCutoff*, create a new cluster from *p* (*p* is too far away 
from the clusters to be part of any one of them)
+ - else (*d <= distanceCutoff*), create a new cluster with probability *d / 
distanceCutoff* (the probability of creating a new cluster increases as *d* 
increases). 
+
+There will be either *r* or *r+1* clusters after processing a new point.
+
+As the number of clusters increases, it will go over the  *clusterOvershoot \* 
numClusters* limit (*numClusters* represents a recommendation for the number of 
clusters that the streaming step should aim for and *clusterOvershoot* is the 
slack). To decrease the number of clusters the existing clusters
+are treated as data points and are re-clustered (collapsed). This tends to 
make the number of clusters go down. If the number of clusters is still too 
high, *distanceCutoff* is increased.
+
+## BallKMeans step
+---
+### Overview
+The algorithm is a Lloyd-type algorithm that takes a set of weighted vectors 
and returns k centroids, see [Ostrovsky et al][2] for details. The algorithm 
has two stages: 
+ 
+ 1. Seeding 
+ 2. Ball k-means 
+
+The seeding stage is an initial guess of where the centroids should be. The 
initial guess is improved using the ball k-means stage. 
+
+### Parameters
+
+* **numClusters** (int): the number k of centroids to return.  The algorithm 
will return exactly this number of centroids.
+
+* **maxNumIterations** (int): After seeding, the iterative clustering 
procedure will be run at most *maxNumIterations* times.  1 or 2 iterations are 
recommended.  Increasing beyond this will increase the accuracy of the result 
at the expense of runtime.  Each successive iteration yields diminishing 
returns in lowering the cost.
+
+* **trimFraction** (double): Outliers are ignored when computing the center of 
mass for a cluster.  For any datapoint *x*, let *c* be the nearest centroid.  
Let *d* be the minimum distance from *c* to another centroid.  If the distance 
from *x* to *c* is greater than *trimFraction \* d*, then *x* is considered an 
outlier during that iteration of ball k-means.  The default is 9/10.  In 
[Ostrovsky et al][2], the authors use *trimFraction* = 1/3, but this does not 
mean that 1/3 is optimal in practice.
+
+* **kMeansPlusPlusInit** (boolean): If true, the seeding method is k-means++.  
If false, the seeding method is to select points uniformly at random.  The 
default is true.
+
+* **correctWeights** (boolean): If *correctWeights* is true, outliers will be 
considered when calculating the weight of centroids.  The default is true. Note 
that outliers are not considered when calculating the position of centroids.
+
+* **testProbability** (double): If *testProbability* is *p* (0 < *p* < 1), the 
data (of size n) is partitioned into a test set (of size *p\*n*) and a training 
set (of size *(1-p)\*n*).  If 0, no test set is created (the entire data set is 
used for both training and testing).  The default is 0.1 if *numRuns* > 1.  If 
*numRuns* = 1, then no test set should be created (since it is only used to 
compare the cost between different runs).
+
+* **numRuns** (int): This is the number of runs to perform. The solution of 
lowest cost is returned.  The default is 1 run.
+
+###Algorithm
+The algorithm can be instructed to take multiple independent runs (using the 
*numRuns* parameter) and the algorithm will select the best solution (i.e., the 
one with the lowest cost). In practice, one run is sufficient to find a good 
solution.  
+
+Each run operates as follows: a seeding procedure is used to select k 
centroids, and then ball k-means is run iteratively to refine the solution.
+
+The seeding procedure can be set to either 'uniformly at random' or 
'k-means++' using *kMeansPlusPlusInit* boolean variable. Seeding with k-means++ 
involves more computation but offers better results in practice. 
+ 
+Each iteration of ball k-means runs as follows:
+
+1. Clusters are formed by assigning each datapoint to the nearest centroid
+2. The centers of mass of the trimmed clusters (see *trimFraction* parameter 
above) become the new centroids 
+
+The data may be partitioned into a test set and a training set (see 
*testProbability*). The seeding procedure and ball k-means run on the training 
set. The cost is computed on the test set.
+
+
+##Usage of *StreamingKMeans*                                                   
                       
+  
+     bin/mahout streamingkmeans  
+       -i <input>  
+       -o <output> 
+       -ow  
+       -k <k>  
+       -km <estimatedNumMapClusters>  
+       -e <estimatedDistanceCutoff>  
+       -mi <maxNumIterations>  
+       -tf <trimFraction>  
+       -ri                  
+       -iw  
+       -testp <testProbability>  
+       -nbkm <numBallKMeansRuns>  
+       -dm <distanceMeasure>   
+       -sc <searcherClass>  
+       -np <numProjections>  
+       -s <searchSize>   
+       -rskm  
+       -xm <method>  
+       -h   
+       --tempDir <tempDir>   
+       --startPhase <startPhase>   
+       --endPhase <endPhase>                    
+
+
+###Details on Job-Specific Options:
+                                                           
+ * `--input (-i) <input>`: Path to job input directory.         
+ * `--output (-o) <output>`: The directory pathname for output.            
+ * `--overwrite (-ow)`: If present, overwrite the output directory before 
running job.
+ * `--numClusters (-k) <k>`: The k in k-Means. Approximately this many 
clusters will be generated.      
+ * `--estimatedNumMapClusters (-km) <estimatedNumMapClusters>`: The estimated 
number of clusters to use for the Map phase of the job when running 
StreamingKMeans. This should be around k \* log(n), where k is the final number 
of clusters and n is the total number of data points to cluster.           
+ * `--estimatedDistanceCutoff (-e) <estimatedDistanceCutoff>`: The initial 
estimated distance cutoff between two points for forming new clusters. If no 
value is given, it's estimated from the data set  
+ * `--maxNumIterations (-mi) <maxNumIterations>`: The maximum number of 
iterations to run for the BallKMeans algorithm used by the reducer. If no value 
is given, defaults to 10.    
+ * `--trimFraction (-tf) <trimFraction>`: The 'ball' aspect of ball k-means 
means that only the closest points to the centroid will actually be used for 
updating. The fraction of the points to be used is those points whose distance 
to the center is within trimFraction \* distance to the closest other center. 
If no value is given, defaults to 0.9.   
+ * `--randomInit` (`-ri`) Whether to use k-means++ initialization or random 
initialization of the seed centroids. Essentially, k-means++ provides better 
clusters, but takes longer, whereas random initialization takes less time, but 
produces worse clusters, and tends to fail more often and needs multiple runs 
to compare to k-means++. If set, uses the random initialization.
+ * `--ignoreWeights (-iw)`: Whether to correct the weights of the centroids 
after the clustering is done. The weights end up being wrong because of the 
trimFraction and possible train/test splits. In some cases, especially in a 
pipeline, having an accurate count of the weights is useful. If set, ignores 
the final weights. 
+ * `--testProbability (-testp) <testProbability>`: A double value  between 0 
and 1  that represents  the percentage of  points to be used  for 'testing'  
different  clustering runs in  the final  BallKMeans step.  If no value is  
given, defaults to  0.1  
+ * `--numBallKMeansRuns (-nbkm) <numBallKMeansRuns>`: Number of  BallKMeans 
runs to  use at the end to  try to cluster the  points. If no  value is given,  
defaults to 4  
+ * `--distanceMeasure (-dm) <distanceMeasure>`: The classname of  the  
DistanceMeasure.  Default is  SquaredEuclidean.  
+ * `--searcherClass (-sc) <searcherClass>`: The type of  searcher to be  used 
when  performing nearest  neighbor searches.  Defaults to  ProjectionSearch.  
+ * `--numProjections (-np) <numProjections>`: The number of  projections  
considered in  estimating the  distances between  vectors. Only used  when the 
distance  measure requested is either ProjectionSearch or FastProjectionSearch. 
If no value is given, defaults to 3.  
+ * `--searchSize (-s) <searchSize>`: In more efficient  searches (non  
BruteSearch), not all distances are calculated for determining the nearest 
neighbors. The number of elements whose distances from the query vector is 
actually computer is proportional to searchSize. If no value is given, defaults 
to 1.  
+ * `--reduceStreamingKMeans (-rskm)`: There might be too many intermediate 
clusters from the mapper to fit into memory, so the reducer can run  another 
pass of StreamingKMeans to collapse them down to a fewer clusters.  
+ * `--method (-xm)` method The execution  method to use:  sequential or  
mapreduce. Default  is mapreduce.  
+ * `-- help (-h)`: Print out help  
+ * `--tempDir <tempDir>`: Intermediate output directory.
+ * `--startPhase <startPhase>` First phase to run.  
+ * `--endPhase <endPhase>` Last phase to run.   
+
+
+##References
+
+1. [M. Shindler, A. Wong, A. Meyerson: Fast and Accurate k-means For Large 
Datasets][1]
+2. [R. Ostrovsky, Y. Rabani, L. Schulman, Ch. Swamy: The Effectiveness of 
Lloyd-Type Methods for the k-means Problem][2]
+
+
+[1]: http://nips.cc/Conferences/2011/Program/event.php?ID=2989 "M. Shindler, 
A. Wong, A. Meyerson: Fast and Accurate k-means For Large Datasets"
+
+[2]: http://www.math.uwaterloo.ca/~cswamy/papers/kmeansfnl.pdf "R. Ostrovsky, 
Y. Rabani, L. Schulman, Ch. Swamy: The Effectiveness of Lloyd-Type Methods for 
the k-means Problem"


http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/map-reduce/index.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/map-reduce/index.md 
b/website-old/docs/algorithms/map-reduce/index.md
new file mode 100644
index 0000000..14d099a
--- /dev/null
+++ b/website-old/docs/algorithms/map-reduce/index.md
@@ -0,0 +1,42 @@
+---
+layout: algorithm
+title: (Deprecated)  Deprecated Map Reduce Algorithms
+theme:
+    name: mahout2
+---
+
+### Classification
+
+[Bayesian](classification/bayesian.html)
+
+[Class Discovery](classification/class-discovery.html)
+
+[Classifying Your Data](classification/classifyingyourdata.html)
+
+[Collocations](classification/collocations.html)
+
+[Gaussian Discriminative 
Analysis](classification/gaussian-discriminative-analysis.html)
+
+[Hidden Markov Models](classification/hidden-markov-models.html)
+
+[Independent Component 
Analysis](classification/independent-component-analysis.html)
+
+[Locally Weighted Linear 
Regression](classification/locally-weighted-linear-regression.html)
+
+[Logistic Regression](classification/logistic-regression.html)
+
+[Mahout Collections](classification/mahout-collections.html)
+
+[Multilayer Perceptron](classification/mlp.html)
+
+[Naive Bayes](classification/naivebayes.html)
+
+[Neural Network](classification/neural-network.html)
+
+[Partial Implementation](classification/partial-implementation.html)
+
+[Random Forrests](classification/random-forrests.html)
+
+[Restricted Boltzman 
Machines](classification/restricted-boltzman-machines.html)
+
+[Support Vector Machines](classification/support-vector-machines.html)

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/preprocessors/AsFactor.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/preprocessors/AsFactor.md 
b/website-old/docs/algorithms/preprocessors/AsFactor.md
new file mode 100644
index 0000000..255f982
--- /dev/null
+++ b/website-old/docs/algorithms/preprocessors/AsFactor.md
@@ -0,0 +1,35 @@
+---
+layout: algorithm
+title: AsFactor
+theme:
+    name: mahout2
+---
+
+
+### About
+
+The `AsFactor` preprocessor is used to turn the integer values of the columns 
into sparse vectors where the value is 1
+ at the index that corresponds to the 'category' of that column.  This is also 
known as "One Hot Encoding" in many other
+ packages. 
+ 
+
+### Parameters
+
+`AsFactor` takes no parameters.
+ 
+### Example
+
+```scala
+import org.apache.mahout.math.algorithms.preprocessing.AsFactor
+
+val A = drmParallelize(dense(
+      (3, 2, 1, 2),
+      (0, 0, 0, 0),
+      (1, 1, 1, 1)), numPartitions = 2)
+
+// 0 -> 2, 3 -> 5, 6 -> 9
+val factorizer: AsFactorModel = new AsFactor().fit(A)
+
+val factoredA = factorizer.transform(A)
+```
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/preprocessors/MeanCenter.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/preprocessors/MeanCenter.md 
b/website-old/docs/algorithms/preprocessors/MeanCenter.md
new file mode 100644
index 0000000..771b6a3
--- /dev/null
+++ b/website-old/docs/algorithms/preprocessors/MeanCenter.md
@@ -0,0 +1,30 @@
+---
+layout: algorithm
+title: MeanCenter
+theme:
+    name: mahout2
+---
+
+### About
+
+`MeanCenter` centers values about the column mean. 
+
+### Parameters
+
+### Example
+
+```scala
+import org.apache.mahout.math.algorithms.preprocessing.MeanCenter
+
+val A = drmParallelize(dense(
+      (1, 1, -2),
+      (2, 5, 2),
+      (3, 9, 0)), numPartitions = 2)
+
+val scaler: MeanCenterModel = new MeanCenter().fit(A)
+
+val centeredA = scaler.transform(A)
+```
+
+
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/preprocessors/StandardScaler.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/preprocessors/StandardScaler.md 
b/website-old/docs/algorithms/preprocessors/StandardScaler.md
new file mode 100644
index 0000000..5b33709
--- /dev/null
+++ b/website-old/docs/algorithms/preprocessors/StandardScaler.md
@@ -0,0 +1,44 @@
+---
+layout: algorithm
+title: StandardScaler
+theme:
+    name: mahout2
+---
+
+### About
+
+`StandardScaler` centers the values of each column to their mean, and scales 
them to unit variance. 
+
+#### Relation to the `scale` function in R-base
+The `StandardScaler` is the equivelent of the R-base function 
[`scale`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/scale.html) 
with
+one noteable tweek. R's `scale` function (indeed all of R) calculates standard 
deviation with 1 degree of freedom, Mahout 
+(like many other statistical packages aimed at larger data sets) does not make 
this adjustment.  In larger datasets the difference
+is trivial, however when testing the function on smaller datasets the 
practicioner may be confused by the discrepency. 
+
+To verify this function against R on an arbitrary matrix, use the following 
form in R to "undo" the degrees of freedom correction.
+```R
+N <- nrow(x)
+scale(x, scale= apply(x, 2, sd) * sqrt(N-1/N))
+```
+
+### Parameters
+
+`StandardScaler` takes no parameters at this time.
+
+### Example
+
+
+```scala
+import org.apache.mahout.math.algorithms.preprocessing.StandardScaler
+
+val A = drmParallelize(dense(
+      (1, 1, 5),
+      (2, 5, -15),
+      (3, 9, -2)), numPartitions = 2)
+
+val scaler: StandardScalerModel = new StandardScaler().fit(A)
+
+val scaledA = scaler.transform(A)
+```
+
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/preprocessors/index.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/preprocessors/index.md 
b/website-old/docs/algorithms/preprocessors/index.md
new file mode 100644
index 0000000..204e06c
--- /dev/null
+++ b/website-old/docs/algorithms/preprocessors/index.md
@@ -0,0 +1,10 @@
+---
+layout: algorithm
+title: Preprocesors
+theme:
+    name: mahout2
+---
+
+[AsFactor](AsFactor.html) - For "one-hot-encoding"
+
+[StandardScaler](StandardScaler.html) - For mean centering an unit variance
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/reccomenders/cco.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/reccomenders/cco.md 
b/website-old/docs/algorithms/reccomenders/cco.md
new file mode 100644
index 0000000..4a6f3f8
--- /dev/null
+++ b/website-old/docs/algorithms/reccomenders/cco.md
@@ -0,0 +1,435 @@
+---
+layout: algorithm
+title: Intro to Cooccurrence Recommenders with Spark
+theme:
+    name: retro-mahout
+---
+
+Mahout provides several important building blocks for creating recommendations 
using Spark. *spark-itemsimilarity* can 
+be used to create "other people also liked these things" type recommendations 
and paired with a search engine can 
+personalize recommendations for individual users. *spark-rowsimilarity* can 
provide non-personalized content based 
+recommendations and when paired with a search engine can be used to 
personalize content based recommendations.
+
+## References
+
+1. A free ebook, which talks about the general idea: [Practical Machine 
Learning](https://www.mapr.com/practical-machine-learning)
+2. A slide deck, which talks about mixing actions or other indicators: 
[Creating a Unified 
Recommender](http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/)
+3. Two blog posts: [What's New in Recommenders: part 
#1](http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/)
+and  [What's New in Recommenders: part 
#2](http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/)
+3. A post describing the loglikelihood ratio:  [Surprise and 
Coinsidense](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html)
  LLR is used to reduce noise in the data while keeping the calculations O(n) 
complexity.
+
+Below are the command line jobs but the drivers and associated code can also 
be customized and accessed from the Scala APIs.
+
+## 1. spark-itemsimilarity
+*spark-itemsimilarity* is the Spark counterpart of the of the Mahout mapreduce 
job called *itemsimilarity*. It takes in elements of interactions, which have 
userID, itemID, and optionally a value. It will produce one of more indicator 
matrices created by comparing every user's interactions with every other user. 
The indicator matrix is an item x item matrix where the values are 
log-likelihood ratio strengths. For the legacy mapreduce version, there were 
several possible similarity measures but these are being deprecated in favor of 
LLR because in practice it performs the best.
+
+Mahout's mapreduce version of itemsimilarity takes a text file that is 
expected to have user and item IDs that conform to 
+Mahout's ID requirements--they are non-negative integers that can be viewed as 
row and column numbers in a matrix.
+
+*spark-itemsimilarity* also extends the notion of cooccurrence to 
cross-cooccurrence, in other words the Spark version will 
+account for multi-modal interactions and create indicator matrices allowing 
the use of much more data in 
+creating recommendations or similar item lists. People try to do this by 
mixing different actions and giving them weights. 
+For instance they might say an item-view is 0.2 of an item purchase. In 
practice this is often not helpful. Spark-itemsimilarity's
+cross-cooccurrence is a more principled way to handle this case. In effect it 
scrubs secondary actions with the action you want
+to recommend.   
+
+
+    spark-itemsimilarity Mahout 1.0
+    Usage: spark-itemsimilarity [options]
+    
+    Disconnected from the target VM, address: '127.0.0.1:64676', transport: 
'socket'
+    Input, output options
+      -i <value> | --input <value>
+            Input path, may be a filename, directory name, or comma delimited 
list of HDFS supported URIs (required)
+      -i2 <value> | --input2 <value>
+            Secondary input path for cross-similarity calculation, same 
restrictions as "--input" (optional). Default: empty.
+      -o <value> | --output <value>
+            Path for output, any local or HDFS supported URI (required)
+    
+    Algorithm control options:
+      -mppu <value> | --maxPrefs <value>
+            Max number of preferences to consider per user (optional). 
Default: 500
+      -m <value> | --maxSimilaritiesPerItem <value>
+            Limit the number of similarities per item to this number 
(optional). Default: 100
+    
+    Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity 
measure.
+    
+    Input text file schema options:
+      -id <value> | --inDelim <value>
+            Input delimiter character (optional). Default: "[,\t]"
+      -f1 <value> | --filter1 <value>
+            String (or regex) whose presence indicates a datum for the primary 
item set (optional). Default: no filter, all data is used
+      -f2 <value> | --filter2 <value>
+            String (or regex) whose presence indicates a datum for the 
secondary item set (optional). If not present no secondary dataset is collected
+      -rc <value> | --rowIDColumn <value>
+            Column number (0 based Int) containing the row ID string 
(optional). Default: 0
+      -ic <value> | --itemIDColumn <value>
+            Column number (0 based Int) containing the item ID string 
(optional). Default: 1
+      -fc <value> | --filterColumn <value>
+            Column number (0 based Int) containing the filter string 
(optional). Default: -1 for no filter
+    
+    Using all defaults the input is expected of the form: "userID<tab>itemId" 
or "userID<tab>itemID<tab>any-text..." and all rows will be used
+    
+    File discovery options:
+      -r | --recursive
+            Searched the -i path recursively for files that match 
--filenamePattern (optional), Default: false
+      -fp <value> | --filenamePattern <value>
+            Regex to match in determining input files (optional). Default: 
filename in the --input option or "^part-.*" if --input is a directory
+    
+    Output text file schema options:
+      -rd <value> | --rowKeyDelim <value>
+            Separates the rowID key from the vector values list (optional). 
Default: "\t"
+      -cd <value> | --columnIdStrengthDelim <value>
+            Separates column IDs from their values in the vector values list 
(optional). Default: ":"
+      -td <value> | --elementDelim <value>
+            Separates vector element values in the values list (optional). 
Default: " "
+      -os | --omitStrength
+            Do not write the strength to the output files (optional), Default: 
false.
+    This option is used to output indexable data for creating a search engine 
recommender.
+    
+    Default delimiters will produce output of the form: 
"itemID1<tab>itemID2:value2<space>itemID10:value10..."
+    
+    Spark config options:
+      -ma <value> | --master <value>
+            Spark Master URL (optional). Default: "local". Note that you can 
specify the number of cores to get a performance improvement, for example 
"local[4]"
+      -sem <value> | --sparkExecutorMem <value>
+            Max Java heap available as "executor memory" on each node 
(optional). Default: 4g
+      -rs <value> | --randomSeed <value>
+            
+      -h | --help
+            prints this usage text
+
+This looks daunting but defaults to simple fairly sane values to take exactly 
the same input as legacy code and is pretty flexible. It allows the user to 
point to a single text file, a directory full of files, or a tree of 
directories to be traversed recursively. The files included can be specified 
with either a regex-style pattern or filename. The schema for the file is 
defined by column numbers, which map to the important bits of data including 
IDs and values. The files can even contain filters, which allow unneeded rows 
to be discarded or used for cross-cooccurrence calculations.
+
+See ItemSimilarityDriver.scala in Mahout's spark module if you want to 
customize the code. 
+
+### Defaults in the _**spark-itemsimilarity**_ CLI
+
+If all defaults are used the input can be as simple as:
+
+    userID1,itemID1
+    userID2,itemID2
+    ...
+
+With the command line:
+
+
+    bash$ mahout spark-itemsimilarity --input in-file --output out-dir
+
+
+This will use the "local" Spark context and will output the standard text 
version of a DRM
+
+    itemID1<tab>itemID2:value2<space>itemID10:value10...
+
+### <a name="multiple-actions">How To Use Multiple User Actions</a>
+
+Often we record various actions the user takes for later analytics. These can 
now be used to make recommendations. 
+The idea of a recommender is to recommend the action you want the user to 
make. For an ecom app this might be 
+a purchase action. It is usually not a good idea to just treat other actions 
the same as the action you want to recommend. 
+For instance a view of an item does not indicate the same intent as a purchase 
and if you just mixed the two together you 
+might even make worse recommendations. It is tempting though since there are 
so many more views than purchases. With *spark-itemsimilarity*
+we can now use both actions. Mahout will use cross-action cooccurrence 
analysis to limit the views to ones that do predict purchases.
+We do this by treating the primary action (purchase) as data for the indicator 
matrix and use the secondary action (view) 
+to calculate the cross-cooccurrence indicator matrix.  
+
+*spark-itemsimilarity* can read separate actions from separate files or from a 
mixed action log by filtering certain lines. For a mixed 
+action log of the form:
+
+    u1,purchase,iphone
+    u1,purchase,ipad
+    u2,purchase,nexus
+    u2,purchase,galaxy
+    u3,purchase,surface
+    u4,purchase,iphone
+    u4,purchase,galaxy
+    u1,view,iphone
+    u1,view,ipad
+    u1,view,nexus
+    u1,view,galaxy
+    u2,view,iphone
+    u2,view,ipad
+    u2,view,nexus
+    u2,view,galaxy
+    u3,view,surface
+    u3,view,nexus
+    u4,view,iphone
+    u4,view,ipad
+    u4,view,galaxy
+
+### Command Line
+
+
+Use the following options:
+
+    bash$ mahout spark-itemsimilarity \
+       --input in-file \     # where to look for data
+        --output out-path \   # root dir for output
+        --master masterUrl \  # URL of the Spark master server
+        --filter1 purchase \  # word that flags input for the primary action
+        --filter2 view \      # word that flags input for the secondary action
+        --itemIDPosition 2 \  # column that has the item ID
+        --rowIDPosition 0 \   # column that has the user ID
+        --filterPosition 1    # column that has the filter word
+
+
+
+### Output
+
+The output of the job will be the standard text version of two Mahout DRMs. 
This is a case where we are calculating 
+cross-cooccurrence so a primary indicator matrix and cross-cooccurrence 
indicator matrix will be created
+
+    out-path
+      |-- similarity-matrix - TDF part files
+      \-- cross-similarity-matrix - TDF part-files
+
+The indicator matrix will contain the lines:
+
+    galaxy\tnexus:1.7260924347106847
+    ipad\tiphone:1.7260924347106847
+    nexus\tgalaxy:1.7260924347106847
+    iphone\tipad:1.7260924347106847
+    surface
+
+The cross-cooccurrence indicator matrix will contain:
+
+    iphone\tnexus:1.7260924347106847 iphone:1.7260924347106847 
ipad:1.7260924347106847 galaxy:1.7260924347106847
+    ipad\tnexus:0.6795961471815897 iphone:0.6795961471815897 
ipad:0.6795961471815897 galaxy:0.6795961471815897
+    nexus\tnexus:0.6795961471815897 iphone:0.6795961471815897 
ipad:0.6795961471815897 galaxy:0.6795961471815897
+    galaxy\tnexus:1.7260924347106847 iphone:1.7260924347106847 
ipad:1.7260924347106847 galaxy:1.7260924347106847
+    surface\tsurface:4.498681156950466 nexus:0.6795961471815897
+
+**Note:** You can run this multiple times to use more than two actions or you 
can use the underlying 
+SimilarityAnalysis.cooccurrence API, which will more efficiently calculate any 
number of cross-cooccurrence indicators.
+
+### Log File Input
+
+A common method of storing data is in log files. If they are written using 
some delimiter they can be consumed directly by spark-itemsimilarity. For 
instance input of the form:
+
+    2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tiphone
+    2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tipad
+    2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tnexus
+    2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tgalaxy
+    2014-06-23 14:46:53.115\tu3\tpurchase\trandom text\tsurface
+    2014-06-23 14:46:53.115\tu4\tpurchase\trandom text\tiphone
+    2014-06-23 14:46:53.115\tu4\tpurchase\trandom text\tgalaxy
+    2014-06-23 14:46:53.115\tu1\tview\trandom text\tiphone
+    2014-06-23 14:46:53.115\tu1\tview\trandom text\tipad
+    2014-06-23 14:46:53.115\tu1\tview\trandom text\tnexus
+    2014-06-23 14:46:53.115\tu1\tview\trandom text\tgalaxy
+    2014-06-23 14:46:53.115\tu2\tview\trandom text\tiphone
+    2014-06-23 14:46:53.115\tu2\tview\trandom text\tipad
+    2014-06-23 14:46:53.115\tu2\tview\trandom text\tnexus
+    2014-06-23 14:46:53.115\tu2\tview\trandom text\tgalaxy
+    2014-06-23 14:46:53.115\tu3\tview\trandom text\tsurface
+    2014-06-23 14:46:53.115\tu3\tview\trandom text\tnexus
+    2014-06-23 14:46:53.115\tu4\tview\trandom text\tiphone
+    2014-06-23 14:46:53.115\tu4\tview\trandom text\tipad
+    2014-06-23 14:46:53.115\tu4\tview\trandom text\tgalaxy    
+
+Can be parsed with the following CLI and run on the cluster producing the same 
output as the above example.
+
+    bash$ mahout spark-itemsimilarity \
+        --input in-file \
+        --output out-path \
+        --master spark://sparkmaster:4044 \
+        --filter1 purchase \
+        --filter2 view \
+        --inDelim "\t" \
+        --itemIDPosition 4 \
+        --rowIDPosition 1 \
+        --filterPosition 2
+
+## 2. spark-rowsimilarity
+
+*spark-rowsimilarity* is the companion to *spark-itemsimilarity* the primary 
difference is that it takes a text file version of 
+a matrix of sparse vectors with optional application specific IDs and it finds 
similar rows rather than items (columns). Its use is
+not limited to collaborative filtering. The input is in text-delimited form 
where there are three delimiters used. By 
+default it reads 
(rowID&lt;tab>columnID1:strength1&lt;space>columnID2:strength2...) Since this 
job only supports LLR similarity,
+ which does not use the input strengths, they may be omitted in the input. It 
writes 
+(rowID&lt;tab>rowID1:strength1&lt;space>rowID2:strength2...) 
+The output is sorted by strength descending. The output can be interpreted as 
a row ID from the primary input followed 
+by a list of the most similar rows.
+
+The command line interface is:
+
+    spark-rowsimilarity Mahout 1.0
+    Usage: spark-rowsimilarity [options]
+    
+    Input, output options
+      -i <value> | --input <value>
+            Input path, may be a filename, directory name, or comma delimited 
list of HDFS supported URIs (required)
+      -o <value> | --output <value>
+            Path for output, any local or HDFS supported URI (required)
+    
+    Algorithm control options:
+      -mo <value> | --maxObservations <value>
+            Max number of observations to consider per row (optional). 
Default: 500
+      -m <value> | --maxSimilaritiesPerRow <value>
+            Limit the number of similarities per item to this number 
(optional). Default: 100
+    
+    Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity 
measure.
+    Disconnected from the target VM, address: '127.0.0.1:49162', transport: 
'socket'
+    
+    Output text file schema options:
+      -rd <value> | --rowKeyDelim <value>
+            Separates the rowID key from the vector values list (optional). 
Default: "\t"
+      -cd <value> | --columnIdStrengthDelim <value>
+            Separates column IDs from their values in the vector values list 
(optional). Default: ":"
+      -td <value> | --elementDelim <value>
+            Separates vector element values in the values list (optional). 
Default: " "
+      -os | --omitStrength
+            Do not write the strength to the output files (optional), Default: 
false.
+    This option is used to output indexable data for creating a search engine 
recommender.
+    
+    Default delimiters will produce output of the form: 
"itemID1<tab>itemID2:value2<space>itemID10:value10..."
+    
+    File discovery options:
+      -r | --recursive
+            Searched the -i path recursively for files that match 
--filenamePattern (optional), Default: false
+      -fp <value> | --filenamePattern <value>
+            Regex to match in determining input files (optional). Default: 
filename in the --input option or "^part-.*" if --input is a directory
+    
+    Spark config options:
+      -ma <value> | --master <value>
+            Spark Master URL (optional). Default: "local". Note that you can 
specify the number of cores to get a performance improvement, for example 
"local[4]"
+      -sem <value> | --sparkExecutorMem <value>
+            Max Java heap available as "executor memory" on each node 
(optional). Default: 4g
+      -rs <value> | --randomSeed <value>
+            
+      -h | --help
+            prints this usage text
+
+See RowSimilarityDriver.scala in Mahout's spark module if you want to 
customize the code. 
+
+# 3. Using *spark-rowsimilarity* with Text Data
+
+Another use case for *spark-rowsimilarity* is in finding similar textual 
content. For instance given the tags associated with 
+a blog post,
+ which other posts have similar tags. In this case the columns are tags and 
the rows are posts. Since LLR is 
+the only similarity method supported this is not the optimal way to determine 
general "bag-of-words" document similarity. 
+LLR is used more as a quality filter than as a similarity measure. However 
*spark-rowsimilarity* will produce 
+lists of similar docs for every doc if input is docs with lists of terms. The 
Apache [Lucene](http://lucene.apache.org) project provides several methods of 
[analyzing and 
tokenizing](http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/analysis/package-summary.html#package_description)
 documents.
+
+# <a name="unified-recommender">4. Creating a Unified Recommender</a>
+
+Using the output of *spark-itemsimilarity* and *spark-rowsimilarity* you can 
build a unified cooccurrence and content based
+ recommender that can be used in both or either mode depending on indicators 
available and the history available at 
+runtime for a user.
+
+## Requirements
+
+1. Mahout 0.10.0 or later
+2. Hadoop
+3. Spark, the correct version for your version of Mahout and Hadoop
+4. A search engine like Solr or Elasticsearch
+
+## Indicators
+
+Indicators come in 3 types
+
+1. **Cooccurrence**: calculated with *spark-itemsimilarity* from user actions
+2. **Content**: calculated from item metadata or content using 
*spark-rowsimilarity*
+3. **Intrinsic**: assigned to items as metadata. Can be anything that 
describes the item.
+
+The query for recommendations will be a mix of values meant to match one of 
your indicators. The query can be constructed 
+from user history and values derived from context (category being viewed for 
instance) or special precalculated data 
+(popularity rank for instance). This blending of indicators allows for 
creating many flavors or recommendations to fit 
+a very wide variety of circumstances.
+
+With the right mix of indicators developers can construct a single query that 
works for completely new items and new users 
+while working well for items with lots of interactions and users with many 
recorded actions. In other words by adding in content and intrinsic 
+indicators developers can create a solution for the "cold-start" problem that 
gracefully improves with more user history
+and as items have more interactions. It is also possible to create a 
completely content-based recommender that personalizes 
+recommendations.
+
+## Example with 3 Indicators
+
+You will need to decide how you store user action data so they can be 
processed by the item and row similarity jobs and 
+this is most easily done by using text files as described above. The data that 
is processed by these jobs is considered the 
+training data. You will need some amount of user history in your recs query. 
It is typical to use the most recent user history 
+but need not be exactly what is in the training set, which may include a 
greater volume of historical data. Keeping the user 
+history for query purposes could be done with a database by storing it in a 
users table. In the example above the two 
+collaborative filtering actions are "purchase" and "view", but let's also add 
tags (taken from catalog categories or other 
+descriptive metadata). 
+
+We will need to create 1 cooccurrence indicator from the primary action 
(purchase) 1 cross-action cooccurrence indicator 
+from the secondary action (view) 
+and 1 content indicator (tags). We'll have to run *spark-itemsimilarity* once 
and *spark-rowsimilarity* once.
+
+We have described how to create the collaborative filtering indicator and 
cross-cooccurrence indicator for purchase and view (the [How to use Multiple 
User 
+Actions](#multiple-actions) section) but tags will be a slightly different 
process. We want to use the fact that 
+certain items have tags similar to the ones associated with a user's 
purchases. This is not a collaborative filtering indicator 
+but rather a "content" or "metadata" type indicator since you are not using 
other users' history, only the 
+individual that you are making recs for. This means that this method will make 
recommendations for items that have 
+no collaborative filtering data, as happens with new items in a catalog. New 
items may have tags assigned but no one
+ has purchased or viewed them yet. In the final query we will mix all 3 
indicators.
+
+## Content Indicator
+
+To create a content-indicator we'll make use of the fact that the user has 
purchased items with certain tags. We want to find 
+items with the most similar tags. Notice that other users' behavior is not 
considered--only other item's tags. This defines a 
+content or metadata indicator. They are used when you want to find items that 
are similar to other items by using their 
+content or metadata, not by which users interacted with them.
+
+For this we need input of the form:
+
+    itemID<tab>list-of-tags
+    ...
+
+The full collection will look like the tags column from a catalog DB. For our 
ecom example it might be:
+
+    3459860b<tab>men long-sleeve chambray clothing casual
+    9446577d<tab>women tops chambray clothing casual
+    ...
+
+We'll use *spark-rowimilairity* because we are looking for similar rows, which 
encode items in this case. As with the 
+collaborative filtering indicator and cross-cooccurrence indicator we use the 
--omitStrength option. The strengths created are 
+probabilistic log-likelihood ratios and so are used to filter unimportant 
similarities. Once the filtering or downsampling 
+is finished we no longer need the strengths. We will get an indicator matrix 
of the form:
+
+    itemID<tab>list-of-item IDs
+    ...
+
+This is a content indicator since it has found other items with similar 
content or metadata.
+
+    3459860b<tab>3459860b 3459860b 6749860c 5959860a 3434860a 3477860a
+    9446577d<tab>9446577d 9496577d 0943577d 8346577d 9442277d 9446577e
+    ...  
+    
+We now have three indicators, two collaborative filtering type and one content 
type.
+
+## Unified Recommender Query
+
+The actual form of the query for recommendations will vary depending on your 
search engine but the intent is the same. 
+For a given user, map their history of an action or content to the correct 
indicator field and perform an OR'd query. 
+
+We have 3 indicators, these are indexed by the search engine into 3 fields, 
we'll call them "purchase", "view", and "tags". 
+We take the user's history that corresponds to each indicator and create a 
query of the form:
+
+    Query:
+      field: purchase; q:user's-purchase-history
+      field: view; q:user's view-history
+      field: tags; q:user's-tags-associated-with-purchases
+      
+The query will result in an ordered list of items recommended for purchase but 
skewed towards items with similar tags to 
+the ones the user has already purchased. 
+
+This is only an example and not necessarily the optimal way to create recs. It 
illustrates how business decisions can be 
+translated into recommendations. This technique can be used to skew 
recommendations towards intrinsic indicators also. 
+For instance you may want to put personalized popular item recs in a special 
place in the UI. Create a popularity indicator 
+by tagging items with some category of popularity (hot, warm, cold for 
instance) then
+index that as a new indicator field and include the corresponding value in a 
query 
+on the popularity field. If we use the ecom example but use the query to get 
"hot" recommendations it might look like this:
+
+    Query:
+      field: purchase; q:user's-purchase-history
+      field: view; q:user's view-history
+      field: popularity; q:"hot"
+
+This will return recommendations favoring ones that have the intrinsic 
indicator "hot".
+
+## Notes
+1. Use as much user action history as you can gather. Choose a primary action 
that is closest to what you want to recommend and the others will be used to 
create cross-cooccurrence indicators. Using more data in this fashion will 
almost always produce better recommendations.
+2. Content can be used where there is no recorded user behavior or when items 
change too quickly to get much interaction history. They can be used alone or 
mixed with other indicators.
+3. Most search engines support "boost" factors so you can favor one or more 
indicators. In the example query, if you want tags to only have a small effect 
you could boost the CF indicators.
+4. In the examples we have used space delimited strings for lists of IDs in 
indicators and in queries. It may be better to use arrays of strings if your 
storage system and search engine support them. For instance Solr allows 
multi-valued fields, which correspond to arrays.

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/reccomenders/d-als.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/reccomenders/d-als.md 
b/website-old/docs/algorithms/reccomenders/d-als.md
new file mode 100644
index 0000000..b5fe1b0
--- /dev/null
+++ b/website-old/docs/algorithms/reccomenders/d-als.md
@@ -0,0 +1,58 @@
+---
+layout: algorithm
+title: Mahout Samsara Distributed ALS
+theme:
+    name: mahout2
+---
+Seems like someone has jacked up this page? 
+TODO: Find the ALS Page
+
+## Intro
+
+Mahout has a distributed implementation of QR decomposition for tall thin 
matricies[1].
+
+## Algorithm 
+
+For the classic QR decomposition of the form 
`\(\mathbf{A}=\mathbf{QR},\mathbf{A}\in\mathbb{R}^{m\times n}\)` a distributed 
version is fairly easily achieved if `\(\mathbf{A}\)` is tall and thin such 
that `\(\mathbf{A}^{\top}\mathbf{A}\)` fits in memory, i.e. *m* is large but 
*n* < ~5000 Under such circumstances, only `\(\mathbf{A}\)` and 
`\(\mathbf{Q}\)` are distributed matricies and `\(\mathbf{A^{\top}A}\)` and 
`\(\mathbf{R}\)` are in-core products. We just compute the in-core version of 
the Cholesky decomposition in the form of `\(\mathbf{LL}^{\top}= 
\mathbf{A}^{\top}\mathbf{A}\)`.  After that we take `\(\mathbf{R}= 
\mathbf{L}^{\top}\)` and 
`\(\mathbf{Q}=\mathbf{A}\left(\mathbf{L}^{\top}\right)^{-1}\)`.  The latter is 
easily achieved by multiplying each verticle block of `\(\mathbf{A}\)` by 
`\(\left(\mathbf{L}^{\top}\right)^{-1}\)`.  (There is no actual matrix 
inversion happening). 
+
+
+
+## Implementation
+
+Mahout `dqrThin(...)` is implemented in the mahout `math-scala` algebraic 
optimizer which translates Mahout's R-like linear algebra operators into a 
physical plan for both Spark and H2O distributed engines.
+
+    def dqrThin[K: ClassTag](A: DrmLike[K], checkRankDeficiency: Boolean = 
true): (DrmLike[K], Matrix) = {        
+        if (drmA.ncol > 5000)
+            log.warn("A is too fat. A'A must fit in memory and easily 
broadcasted.")
+        implicit val ctx = drmA.context
+        val AtA = (drmA.t %*% drmA).checkpoint()
+        val inCoreAtA = AtA.collect
+        val ch = chol(inCoreAtA)
+        val inCoreR = (ch.getL cloned) t
+        if (checkRankDeficiency && !ch.isPositiveDefinite)
+            throw new IllegalArgumentException("R is rank-deficient.")
+        val bcastAtA = sc.broadcast(inCoreAtA)
+        val Q = A.mapBlock() {
+            case (keys, block) => keys -> chol(bcastAtA).solveRight(block)
+        }
+        Q -> inCoreR
+    }
+
+
+## Usage
+
+The scala `dqrThin(...)` method can easily be called in any Spark or H2O 
application built with the `math-scala` library and the corresponding `Spark` 
or `H2O` engine module as follows:
+
+    import org.apache.mahout.math._
+    import decompositions._
+    import drm._
+    
+    val(drmQ, inCoreR) = dqrThin(drma)
+
+ 
+## References
+
+[1]: [Mahout Scala and Mahout Spark Bindings for Linear Algebra 
Subroutines](http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf)
+
+[2]: [Mahout Spark and Scala 
Bindings](http://mahout.apache.org/users/sparkbindings/home.html)
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/reccomenders/index.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/reccomenders/index.md 
b/website-old/docs/algorithms/reccomenders/index.md
new file mode 100644
index 0000000..00d8ec4
--- /dev/null
+++ b/website-old/docs/algorithms/reccomenders/index.md
@@ -0,0 +1,33 @@
+---
+layout: algorithm
+title: Recommender Quickstart
+theme:
+    name: retro-mahout
+---
+
+
+# Recommender Overview
+
+Recommenders have changed over the years. Mahout contains a long list of them, 
which you can still use. But to get the best  out of our more modern aproach 
we'll need to think of the Recommender as a "model creation" 
component&mdash;supplied by Mahout's new spark-itemsimilarity job, and a 
"serving" component&mdash;supplied by a modern scalable search engine, like 
Solr.
+
+![image](http://i.imgur.com/fliHMBo.png)
+
+To integrate with your application you will collect user interactions storing 
them in a DB and also in a from usable by Mahout. The simplest way to do this 
is to log user interactions to csv files (user-id, item-id). The DB should be 
setup to contain the last n user interactions, which will form part of the 
query for recommendations.
+
+Mahout's spark-itemsimilarity will create a table of (item-id, 
list-of-similar-items) in csv form. Think of this as an item collection with 
one field containing the item-ids of similar items. Index this with your search 
engine. 
+
+When your application needs recommendations for a specific person, get the 
latest user history of interactions from the DB and query the indicator 
collection with this history. You will get back an ordered list of item-ids. 
These are your recommendations. You may wish to filter out any that the user 
has already seen but that will depend on your use case.
+
+All ids for users and items are preserved as string tokens and so work as an 
external key in DBs or as doc ids for search engines, they also work as tokens 
for search queries.
+
+##References
+
+1. A free ebook, which talks about the general idea: [Practical Machine 
Learning](https://www.mapr.com/practical-machine-learning)
+2. A slide deck, which talks about mixing actions or other indicators: 
[Creating a Multimodal Recommender with Mahout and a Search 
Engine](http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/)
+3. Two blog posts: [What's New in Recommenders: part 
#1](http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/)
+and  [What's New in Recommenders: part 
#2](http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/)
+3. A post describing the loglikelihood ratio:  [Surprise and 
Coinsidense](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html)
  LLR is used to reduce noise in the data while keeping the calculations O(n) 
complexity.
+
+##Mahout Model Creation
+
+See the page describing 
[*spark-itemsimilarity*](http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html)
 for more details.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/regression/fittness-tests.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/regression/fittness-tests.md 
b/website-old/docs/algorithms/regression/fittness-tests.md
new file mode 100644
index 0000000..1bd8984
--- /dev/null
+++ b/website-old/docs/algorithms/regression/fittness-tests.md
@@ -0,0 +1,17 @@
+---
+layout: algorithm
+title: Regression Fitness Tests
+theme:
+    name: mahout2
+---
+
+TODO: Fill this out!
+Stub
+
+### About
+
+### Parameters
+
+### Example
+
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/regression/index.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/regression/index.md 
b/website-old/docs/algorithms/regression/index.md
new file mode 100644
index 0000000..3639c4f
--- /dev/null
+++ b/website-old/docs/algorithms/regression/index.md
@@ -0,0 +1,21 @@
+---
+layout: algorithm
+
+title: Regressoin Algorithms
+theme:
+    name: retro-mahout
+---
+
+Apache Mahout implements the following regression algorithms "off the shelf".
+
+### Closed Form Solutions
+
+These methods used close form solutions (not stochastic) to solve regression 
problems
+
+[Ordinary Least Squares](ols.html)
+
+### Autocorrelation Regression
+
+Serial Correlation of the error terms can lead to biased estimates of 
regression parameters, the following remedial procedures are provided:
+
+[Cochrane Orcutt Procedure](cochrane-orcutt.html)

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/regression/ols.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/regression/ols.md 
b/website-old/docs/algorithms/regression/ols.md
new file mode 100644
index 0000000..58c74c5
--- /dev/null
+++ b/website-old/docs/algorithms/regression/ols.md
@@ -0,0 +1,66 @@
+---
+layout: algorithm
+title: Ordinary Least Squares Regression
+theme:
+    name: mahout2
+---
+
+### About
+
+The `OrinaryLeastSquares` regressor in Mahout implements a _closed-form_ 
solution to [Ordinary Least 
Squares](https://en.wikipedia.org/wiki/Ordinary_least_squares). 
+This is in stark contrast to many "big data machine learning" frameworks which 
implement a _stochastic_ approach. From the users perspecive this difference 
can be reduced to:
+
+- **_Stochastic_**- A series of guesses at a line line of best fit. 
+- **_Closed Form_**- A mathimatical approach has been explored, the properties 
of the parameters are well understood, and problems which arise (and the 
remedial measures), exist.  This is usually the preferred choice of 
mathematicians/statisticians, but computational limititaions have forced us to 
resort to SGD.
+
+### Parameters
+
+<div class="table-striped">
+  <table class="table">
+    <tr>
+        <th>Parameter</th>
+        <th>Description</th>
+        <th>Default Value</th>
+     </tr>
+     <tr>
+        <td><code>'calcCommonStatistics</code></td>
+        <td>Calculate commons statistics such as Coeefficient of Determination 
and Mean Square Error</td>
+        <td><code>true</code></td>
+     </tr>
+     <tr>
+        <td><code>'calcStandardErrors</code></td>
+        <td>Calculate the standard errors (and subsequent "t-scores" and 
"p-values") of the \(\boldsymbol{\beta}\) estimates</td>
+        <td><code>true</code></td>
+     </tr>
+     <tr>
+        <td><code>'addIntercept</code></td>
+        <td>Add an intercept to \(\mathbf{X}\)</td>
+        <td><code>true</code></td>
+     </tr>                 
+  </table>
+</div>
+
+### Example
+
+In this example we disable the "calculate common statistics" parameters, so 
our summary will NOT contain the coefficient of determination (R-squared) or 
Mean Square Error
+```scala
+import org.apache.mahout.math.algorithms.regression.OrdinaryLeastSquares
+
+val drmData = drmParallelize(dense(
+      (2, 2, 10.5, 10, 29.509541),  // Apple Cinnamon Cheerios
+      (1, 2, 12,   12, 18.042851),  // Cap'n'Crunch
+      (1, 1, 12,   13, 22.736446),  // Cocoa Puffs
+      (2, 1, 11,   13, 32.207582),  // Froot Loops
+      (1, 2, 12,   11, 21.871292),  // Honey Graham Ohs
+      (2, 1, 16,   8,  36.187559),  // Wheaties Honey Gold
+      (6, 2, 17,   1,  50.764999),  // Cheerios
+      (3, 2, 13,   7,  40.400208),  // Clusters
+      (3, 3, 13,   4,  45.811716)), numPartitions = 2)
+
+
+val drmX = drmData(::, 0 until 4)
+val drmY = drmData(::, 4 until 5)
+
+val model = new OrdinaryLeastSquares[Int]().fit(drmX, drmY, 
'calcCommonStatistics â false)
+println(model.summary)
+```

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/regression/serial-correlation/cochrane-orcutt.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/algorithms/regression/serial-correlation/cochrane-orcutt.md 
b/website-old/docs/algorithms/regression/serial-correlation/cochrane-orcutt.md
new file mode 100644
index 0000000..b155ca8
--- /dev/null
+++ 
b/website-old/docs/algorithms/regression/serial-correlation/cochrane-orcutt.md
@@ -0,0 +1,146 @@
+---
+layout: algorithm
+title: Cochrane-Orcutt Procedure
+theme:
+    name: mahout2
+---
+
+### About
+
+The [Cochrane 
Orcutt](https://en.wikipedia.org/wiki/Cochrane%E2%80%93Orcutt_estimation) 
procedure is use in economics to 
+adjust a linear model for serial correlation in the error term. 
+ 
+The cooresponding method in R is 
[`cochrane.orcutt`](https://cran.r-project.org/web/packages/orcutt/orcutt.pdf)
+however the implementation differes slightly. 
+
+#### R Prototype:
+    library(orcutt)
+
+    df = data.frame(t(data.frame(
+        c(20.96,  127.3),
+        c(21.40,  130.0),
+        c(21.96,  132.7),
+        c(21.52,  129.4),
+        c(22.39,  135.0),
+        c(22.76,  137.1),
+        c(23.48,  141.2),
+        c(23.66,  142.8),
+        c(24.10,  145.5),
+        c(24.01,  145.3),
+        c(24.54,  148.3),
+        c(24.30,  146.4),
+        c(25.00,  150.2),
+        c(25.64,  153.1),
+        c(26.36,  157.3),
+        c(26.98,  160.7),
+        c(27.52,  164.2),
+        c(27.78,  165.6),
+        c(28.24,  168.7),
+        c(28.78,  171.7))))
+
+    rownames(df) <- NULL
+    colnames(df) <- c("y", "x")
+    my_lm = lm(y ~ x, data=df)
+    coch = cochrane.orcutt(my_lm)
+
+    
+The R-implementation is kind of...silly.
+
+The above works- converges at 318 iterations- the transformed DW is   1.72, 
yet the rho is
+ .95882.   After 318 iteartions, this will also report a rho of .95882 (which 
sugguests SEVERE
+ autocorrelation- nothing close to 1.72.
+
+ At anyrate, the real prototype for this is the example from [Applied Linear 
Statistcal Models
+ 5th Edition by Kunter, Nachstheim, Neter, and 
Li](https://www.amazon.com/Applied-Linear-Statistical-Models-Hardcover/dp/B010EWX85C/ref=sr_1_4?ie=UTF8&qid=1493847480&sr=8-4&keywords=applied+linear+statistical+models+5th+edition).
  
+ 
+Steps:
+1. Normal Regression
+2. Estimate <foo>\(\rho\)</foo>
+3. Get Estimates of Transformed Equation
+4. Step 5: Use Betas from (4) to recalculate model from (1)
+5. Step 6: repeat  Step 2 through 5 until a stopping criteria is met. Some 
models call for convergence-
+Kunter et. al reccomend 3 iterations, if you don't achieve desired results, 
use an alternative method.
+ 
+#### Some additional notes from Applied Linear Statistical Models:
+ They also provide some interesting notes on p 494:
+ 
+ 1. "Cochrane-Orcutt does not always work properly.  A major reason is that 
when the error terms
+ are positively autocorrelated, the estimate <foo>\(r\)</foo> in (12.22) tends 
to underestimate the autocorrelation
+ parameter <foo>\(\rho\)</foo>.  When this bias is serious, it can 
significantly reduce the effectiveness of the
+ Cochrane-Orcutt approach.
+ 1. "There exists an approximate relation between the [Durbin Watson test 
statistic](dw-test.html) <foo>\(\mathbf{D}\)</foo> in (12.14)
+ and the estimated autocorrelation paramater <foo>\(r\)</foo> in (12.22):
+ <center>\(D ~= 2(1-\rho)\)</center>
+
+ They also note on p492:
+ "... If the process does not terminate after one or two iterations, a 
different procedure
+ should be employed."
+ This differs from the logic found elsewhere, and the method presented in R 
where, in the simple
+  example in the prototype, the procedure runs for 318 iterations. This is why 
the default
+ maximum iteratoins are 3, and should be left as such.
+
+ Also, the prototype and 'correct answers' are based on the example presented 
in Kunter et. al on
+ p492-4 (including dataset).
+
+
+### Parameters
+
+
+<div class="table-striped">
+  <table class="table">
+    <tr>
+        <th>Parameter</th>
+        <th>Description</th>
+        <th>Default Value</th>
+     </tr>
+     <tr>
+        <td><code>'regressor</code></td>
+        <td>Any subclass of 
<code>org.apache.mahout.math.algorithms.regression.LinearRegressorFitter</code></td>
+        <td><code>OrdinaryLeastSquares()</code></td>
+     </tr>
+     <tr>
+        <td><code>'iteratoins</code></td>
+        <td>Unlike our friends in R- we stick to the 3 iteration guidance.</td>
+        <td>3</td>
+     </tr>
+     <tr>
+        <td><code>'cacheHint</code></td>
+        <td>The DRM Cache Hint to use when holding the data in memory between 
iterations</td>
+        <td><code>CacheHint.MEMORY_ONLY</code></td>
+     </tr>                 
+  </table>
+</div>
+
+### Example
+
+
+    val alsmBlaisdellCo = drmParallelize( dense(
+          (20.96,  127.3),
+          (21.40,  130.0),
+          (21.96,  132.7),
+          (21.52,  129.4),
+          (22.39,  135.0),
+          (22.76,  137.1),
+          (23.48,  141.2),
+          (23.66,  142.8),
+          (24.10,  145.5),
+          (24.01,  145.3),
+          (24.54,  148.3),
+          (24.30,  146.4),
+          (25.00,  150.2),
+          (25.64,  153.1),
+          (26.36,  157.3),
+          (26.98,  160.7),
+          (27.52,  164.2),
+          (27.78,  165.6),
+          (28.24,  168.7),
+          (28.78,  171.7) ))
+    
+    val drmY = alsmBlaisdellCo(::, 0 until 1)
+    val drmX = alsmBlaisdellCo(::, 1 until 2)
+
+    var coModel = new CochraneOrcutt[Int]().fit(drmX, drmY , ('iterations -> 
2))
+    
+    println(coModel.rhos)
+    println(coModel.summary)
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/regression/serial-correlation/dw-test.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/algorithms/regression/serial-correlation/dw-test.md 
b/website-old/docs/algorithms/regression/serial-correlation/dw-test.md
new file mode 100644
index 0000000..7bdd896
--- /dev/null
+++ b/website-old/docs/algorithms/regression/serial-correlation/dw-test.md
@@ -0,0 +1,43 @@
+---
+layout: algorithm
+title: Durbin-Watson Test
+theme:
+    name: mahout2
+---
+
+### About
+
+The [Durbin Watson 
Test](https://en.wikipedia.org/wiki/Durbin%E2%80%93Watson_statistic) is a test 
for serial correlation
+in error terms.  The Durbin Watson test statistic <foo>\(d\)</foo> can take 
values between 0 and 4, and in general
+
+- <foo>\(d \lt 1.5 \)</foo> implies positive autocorrelation
+- <foo>\(d \gt 2.5 \)</foo>  implies negative autocorrelation
+- <foo>\(1.5 \lt d \lt 2.5 \)</foo>  implies to autocorrelation.
+
+Implementation is based off of the `durbinWatsonTest` function in the 
[`car`](https://cran.r-project.org/web/packages/car/index.html) package in R
+
+### Parameters
+
+### Example
+
+#### R Prototype
+    
+    library(car)
+    residuals <- seq(0, 4.9, 0.1)
+    ## perform Durbin-Watson test
+    durbinWatsonTest(residuals)
+    
+#### In Apache Mahout
+
+    
+    // A DurbinWatson Test must be performed on a model. The model does not 
matter.
+    val drmX = drmParallelize( dense((0 until 50).toArray.map( t => 
Math.pow(-1.0, t)) ) ).t
+    val drmY = drmX + err1 + 1
+    var model = new OrdinaryLeastSquares[Int]().fit(drmX, drmY)
+    // end arbitrary model 
+    
+    val err1 =  drmParallelize( dense((0.0 until 5.0 by 0.1).toArray) ).t
+    val syntheticResiduals = err1
+    model = AutocorrelationTests.DurbinWatson(model, syntheticResiduals)
+    val myAnswer: Double = 
model.testResults.getOrElse('durbinWatsonTestStatistic, 
-1.0).asInstanceOf[Double]
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/algorithms/template.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/template.md 
b/website-old/docs/algorithms/template.md
new file mode 100644
index 0000000..4a48829
--- /dev/null
+++ b/website-old/docs/algorithms/template.md
@@ -0,0 +1,20 @@
+---
+layout: algorithm
+title: AsFactor
+theme:
+    name: mahout2
+---
+
+TODO: Fill this out!
+Stub
+
+### About
+
+### Parameters
+
+### Example
+
+
+
+
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/assets
----------------------------------------------------------------------
diff --git a/website-old/docs/assets b/website-old/docs/assets
new file mode 120000
index 0000000..ec2e4be
--- /dev/null
+++ b/website-old/docs/assets
@@ -0,0 +1 @@
+../assets
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/changelog.md
----------------------------------------------------------------------
diff --git a/website-old/docs/changelog.md b/website-old/docs/changelog.md
new file mode 100755
index 0000000..7965e9d
--- /dev/null
+++ b/website-old/docs/changelog.md
@@ -0,0 +1,70 @@
+## Changelog
+
+Public releases are all root nodes.  
+Incremental version bumps that were not released publicly are nested where 
appropriate.
+
+P.S. If there is a standard (popular) changelog format, please let me know.
+
+- **0.3.0 : 2013.02.24**
+    - **Features**
+    - Update twitter bootstrap to 2.2.2. Add responsiveness and update design 
a bit.
+    - @techotaku fixes custom tagline support (finally made it in!)
+    - @opie4624 adds ability to set tags from the command-line.
+    - @lax adds support for RSS feed. Adds rss and atom html links for 
discovery.
+    - Small typo fixes.
+
+    - **Bug Fixes**
+    - @xuhdev fixes theme:install bug which does not overwrite theme even if 
saying 'yes'.
+
+- **0.2.13 : 2012.03.24**   
+    - **Features**
+    - 0.2.13 : @mjpieters Updates pages_list helper to only show pages having 
a title.
+    - 0.2.12 : @sway recommends showing page tagline only if tagline is set.
+    - 0.2.11 : @LukasKnuth adds 'description' meta-data field to post/page 
scaffold.
+
+    - **Bug Fixes**
+    - 0.2.10 : @koriroys fixes typo in atom feed
+
+- **0.2.9 : 2012.03.01**   
+    - **Bug Fixes**
+    - 0.2.9 : @alishutc Fixes the error on post creation if date was not 
specified.
+
+- **0.2.8 : 2012.03.01**   
+    - **Features**
+    - 0.2.8 : @metalelf0 Added option to specify a custom date when creating 
post.
+    - 0.2.7 : @daz Updates twitter theme framework to use 2.x while still 
maintaining core layout. #50
+              @philips and @treggats add support for page.tagline metadata. 
#31 & #48
+    - 0.2.6 : @koomar Adds Mixpanel analytics provider. #49
+    - 0.2.5 : @nolith Adds ability to load custom rake scripts. #33
+    - 0.2.4 : @tommyblue Updated disqus comments provider to be compatible 
with posts imported from Wordpress. #47
+
+    - **Bug Fixes**
+    - 0.2.3 : @3martini Adds Windows MSYS Support and error checks for git 
system calls. #40
+    - 0.2.2 : @sstar Resolved an issue preventing disabling comments for 
individual pages #44
+    - 0.2.1 : Resolve incorrect HOME\_PATH/BASE\_PATH settings
+
+- **0.2.0 : 2012.02.01**   
+  Features
+    - Add Theme Packages v 0.1.0
+      All themes should be tracked and maintained outside of JB core.
+      Themes get "installed" via the Theme Installer.
+      Theme Packages versioning is done separately from JB core with
+      the main intent being to make sure theme versions are compatible with 
the given installer.
+
+    - 0.1.2 : @jamesFleeting adds facebook comments support
+    - 0.1.1 : @SegFaultAX adds tagline as site-wide configuration
+
+- **0.1.0 : 2012.01.24**   
+  First major versioned release.   
+  Features   
+    - Standardize Public API
+    - Use name-spacing and modulation where possible.
+    - Ability to override public methods with custom code.
+    - Publish the theme API.
+    - Ship with comments, analytics integration.
+  
+- **0.0.1 : 2011.12.30**    
+  First public release, lots of updates =p
+  Thank you everybody for dealing with the fast changes and helping
+  me work out the API to a manageable state.
+  

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/distributed/flink-bindings.md
----------------------------------------------------------------------
diff --git a/website-old/docs/distributed/flink-bindings.md 
b/website-old/docs/distributed/flink-bindings.md
new file mode 100644
index 0000000..4a39e56
--- /dev/null
+++ b/website-old/docs/distributed/flink-bindings.md
@@ -0,0 +1,49 @@
+---
+layout: page
+title: Mahout Samsara Flink Bindings
+theme:
+    name: mahout2
+---
+# Introduction
+
+This document provides an overview of how the Mahout Samsara environment is 
implemented over the Apache Flink backend engine. This document gives an 
overview of the code layout for the Flink backend engine, the source code for 
which can be found under /flink directory in the Mahout codebase.
+
+Apache Flink is a distributed big data streaming engine that supports both 
Streaming and Batch interfaces. Batch processing is an extension of Flinkâs 
Stream processing engine.
+
+The Mahout Flink integration presently supports Flinkâs batch processing 
capabilities leveraging the DataSet API.
+
+The Mahout DRM, or Distributed Row Matrix, is an abstraction for storing a 
large matrix of numbers in-memory in a cluster by distributing logical rows 
among servers. Mahout's scala DSL provides an abstract API on DRMs for backend 
engines to provide implementations of this API. An example is the Spark backend 
engine. Each engine has it's own design of mapping the abstract API onto its 
data model and provides implementations for algebraic operators over that 
mapping.
+
+# Flink Overview
+
+Apache Flink is an open source, distributed Stream and Batch Processing 
Framework. At it's core, Flink is a Stream Processing engine and Batch 
processing is an extension of Stream Processing. 
+
+Flink includes several APIs for building applications with the Flink Engine:
+
+ <ol>
+<li><b>DataSet API</b> for Batch data in Java, Scala and Python</li>
+<li><b>DataStream API</b> for Stream Processing in Java and Scala</li>
+<li><b>Table API</b> with SQL-like regular expression language in Java and 
Scala</li>
+<li><b>Gelly</b> Graph Processing API in Java and Scala</li>
+<li><b>CEP API</b>, a complex event processing library</li>
+<li><b>FlinkML</b>, a Machine Learning library</li>
+</ol>
+#Flink Environment Engine
+
+The Flink backend implements the abstract DRM as a Flink DataSet. A Flink job 
runs in the context of an ExecutionEnvironment (from the Flink Batch processing 
API).
+
+# Source Layout
+
+Within mahout.git, the top level directory, flink/ holds all the source code 
for the Flink backend engine. Sections of code that interface with the rest of 
the Mahout components are in Scala, and sections of the code that interface 
with Flink DataSet API and implement algebraic operators are in Java. Here is a 
brief overview of what functionality can be found within flink/ folder.
+
+flink/ - top level directory containing all Flink related code
+
+flink/src/main/scala/org/apache/mahout/flinkbindings/blas/*.scala - Physical 
operator code for the Samsara DSL algebra
+
+flink/src/main/scala/org/apache/mahout/flinkbindings/drm/*.scala - Flink 
Dataset DRM and broadcast implementation
+
+flink/src/main/scala/org/apache/mahout/flinkbindings/io/*.scala - Read / Write 
between DRMDataSet and files on HDFS
+
+flink/src/main/scala/org/apache/mahout/flinkbindings/FlinkEngine.scala - DSL 
operator graph evaluator and various abstract API implementations for a 
distributed engine.
+
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/distributed/h2o-internals.md
----------------------------------------------------------------------
diff --git a/website-old/docs/distributed/h2o-internals.md 
b/website-old/docs/distributed/h2o-internals.md
new file mode 100644
index 0000000..2388354
--- /dev/null
+++ b/website-old/docs/distributed/h2o-internals.md
@@ -0,0 +1,50 @@
+---
+layout: page
+title: Mahout Samsara H20 Bindings
+theme:
+    name: mahout2
+---
+# Introduction
+ 
+This document provides an overview of how the Mahout Samsara environment is 
implemented over the H2O backend engine. The document is aimed at Mahout 
developers, to give a high level description of the design so that one can 
explore the code inside `h2o/` with some context.
+
+## H2O Overview
+
+H2O is a distributed scalable machine learning system. Internal architecture 
of H2O has a distributed math engine (h2o-core) and a separate layer on top for 
algorithms and UI. The Mahout integration requires only the math engine 
(h2o-core).
+
+## H2O Data Model
+
+The data model of the H2O math engine is a distributed columnar store (of 
primarily numbers, but also strings). A column of numbers is called a Vector, 
which is broken into Chunks (of a few thousand elements). Chunks are 
distributed across the cluster based on a deterministic hash. Therefore, any 
member of the cluster knows where a particular Chunk of a Vector is homed. Each 
Chunk is separately compressed in memory and elements are individually 
decompressed on the fly upon access with purely register operations (thereby 
achieving high memory throughput). An ordered set of similarly partitioned Vecs 
are composed into a Frame. A Frame is therefore a large two dimensional table 
of numbers. All elements of a logical row in the Frame are guaranteed to be 
homed in the same server of the cluster. Generally speaking, H2O works well on 
"tall skinny" data, i.e, lots of rows (100s of millions) and modest number of 
columns (10s of thousands).
+
+
+## Mahout DRM
+
+The Mahout DRM, or Distributed Row Matrix, is an abstraction for storing a 
large matrix of numbers in-memory in a cluster by distributing logical rows 
among servers. Mahout's scala DSL provides an abstract API on DRMs for backend 
engines to provide implementations of this API. Examples are the Spark and H2O 
backend engines. Each engine has it's own design of mapping the abstract API 
onto its data model and provides implementations for algebraic operators over 
that mapping.
+
+
+## H2O Environment Engine
+
+The H2O backend implements the abstract DRM as an H2O Frame. Each logical 
column in the DRM is an H2O Vector. All elements of a logical DRM row are 
guaranteed to be homed on the same server. A set of rows stored on a server are 
presented as a read-only virtual in-core Matrix (i.e BlockMatrix) for the 
closure method in the `mapBlock(...)` API.
+
+H2O provides a flexible execution framework called `MRTask`. The `MRTask` 
framework typically executes over a Frame (or even a Vector), supports various 
types of map() methods, can optionally modify the Frame or Vector (though this 
never happens in the Mahout integration), and optionally create a new Vector or 
set of Vectors (to combine them into a new Frame, and consequently a new DRM).
+
+
+## Source Layout
+
+Within mahout.git, the top level directory, `h2o/` holds all the source code 
related to the H2O backend engine. Part of the code (that interfaces with the 
rest of the Mahout componenets) is in Scala, and part of the code (that 
interfaces with h2o-core and implements algebraic operators) is in Java. Here 
is a brief overview of what functionality can be found where within `h2o/`.
+
+  h2o/ - top level directory containing all H2O related code
+
+  h2o/src/main/java/org/apache/mahout/h2obindings/ops/*.java - Physical 
operator code for the various DSL algebra
+
+  h2o/src/main/java/org/apache/mahout/h2obindings/drm/*.java - DRM backing 
(onto Frame) and Broadcast implementation
+
+  h2o/src/main/java/org/apache/mahout/h2obindings/H2OHdfs.java - Read / Write 
between DRM (Frame) and files on HDFS
+
+  h2o/src/main/java/org/apache/mahout/h2obindings/H2OBlockMatrix.java - A 
vertical block matrix of DRM presented as a virtual copy-on-write in-core 
Matrix. Used in mapBlock() API
+
+  h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java - A 
collection of various functionality and helpers. For e.g, convert between 
in-core Matrix and DRM, various summary statistics on DRM/Frame.
+
+  h2o/src/main/scala/org/apache/mahout/h2obindings/H2OEngine.scala - DSL 
operator graph evaluator and various abstract API implementations for a 
distributed engine
+
+  h2o/src/main/scala/org/apache/mahout/h2obindings/* - Various abstract API 
implementations ("glue work")
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/distributed/spark-bindings/MahoutScalaAndSparkBindings.pptx
----------------------------------------------------------------------
diff --git 
a/website-old/docs/distributed/spark-bindings/MahoutScalaAndSparkBindings.pptx 
b/website-old/docs/distributed/spark-bindings/MahoutScalaAndSparkBindings.pptx
new file mode 100644
index 0000000..ec1de04
Binary files /dev/null and 
b/website-old/docs/distributed/spark-bindings/MahoutScalaAndSparkBindings.pptx 
differ

[23/51] [partial] mahout git commit: New Website courtesy of startbootstrap.com

Reply via email to