[21/62] [abbrv] mahout git commit: WEBSITE Ported MR-Clustering Tutorials and Algos

rawkintrevo Thu, 04 May 2017 18:41:32 -0700

http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/clustering/canopy-commandline.md
----------------------------------------------------------------------
diff --git a/website/docs/tutorials/map-reduce/clustering/canopy-commandline.md 
b/website/docs/tutorials/map-reduce/clustering/canopy-commandline.md
new file mode 100644
index 0000000..cbaa81d
--- /dev/null
+++ b/website/docs/tutorials/map-reduce/clustering/canopy-commandline.md
@@ -0,0 +1,70 @@
+---
+layout: mr_tutorial
+title: canopy-commandline
+theme:
+   name: retro-mahout
+---
+
+<a name="canopy-commandline-RunningCanopyClusteringfromtheCommandLine"></a>
+# Running Canopy Clustering from the Command Line
+Mahout's Canopy clustering can be launched from the same command line
+invocation whether you are running on a single machine in stand-alone mode
+or on a larger Hadoop cluster. The difference is determined by the
+$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to
+an operating Hadoop cluster on the target machine then the invocation will
+run Canopy on that cluster. If either of the environment variables are
+missing then the stand-alone Hadoop configuration will be invoked instead.
+
+
+    ./bin/mahout canopy <OPTIONS>
+
+
+* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.job
+
+
+<a name="canopy-commandline-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster
+
+* Put the data: cp <PATH TO DATA> testdata
+* Run the Job: 
+
+    ./bin/mahout canopy -i testdata -o output -dm
+org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 5 -t2 2
+
+
+<a name="canopy-commandline-Runningitonthecluster"></a>
+## Running it on the cluster
+
+* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+* Run the Job: 
+
+    export HADOOP_HOME=<Hadoop Home Directory>
+    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
+    ./bin/mahout canopy -i testdata -o output -dm
+org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 5 -t2 2
+
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="canopy-commandline-Commandlineoptions"></a>
+# Command line options
+
+      --input (-i) input                            Path to job input 
directory.Must  
+                                            be a SequenceFile of           
+                                            VectorWritable                 
+      --output (-o) output                          The directory pathname for 
output. 
+      --overwrite (-ow)                             If present, overwrite the 
output    
+                                            directory before running job   
+      --distanceMeasure (-dm) distanceMeasure    The classname of the      
+                                            DistanceMeasure. Default is    
+                                            SquaredEuclidean               
+      --t1 (-t1) t1                         T1 threshold value             
+      --t2 (-t2) t2                         T2 threshold value             
+      --clustering (-cl)                            If present, run clustering 
after   
+                                            the iterations have taken place    
 
+      --help (-h)                                   Print out help             
    
+


http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/clustering/clustering-of-synthetic-control-data.md
----------------------------------------------------------------------
diff --git 
a/website/docs/tutorials/map-reduce/clustering/clustering-of-synthetic-control-data.md
 
b/website/docs/tutorials/map-reduce/clustering/clustering-of-synthetic-control-data.md
new file mode 100644
index 0000000..abd2d3e
--- /dev/null
+++ 
b/website/docs/tutorials/map-reduce/clustering/clustering-of-synthetic-control-data.md
@@ -0,0 +1,53 @@
+---
+layout: mr_tutorial
+title: Clustering of synthetic control data
+theme:
+   name: retro-mahout
+---
+
+# Clustering synthetic control data
+
+## Introduction
+
+This example will demonstrate clustering of time series data, specifically 
control charts. [Control charts](http://en.wikipedia.org/wiki/Control_chart) 
are tools used to determine whether a manufacturing or business process is in a 
state of statistical control. Such control charts are generated / simulated 
repeatedly at equal time intervals. A [simulated 
dataset](http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html)
 is available for use in UCI machine learning repository.
+
+A time series of control charts needs to be clustered into their close knit 
groups. The data set we use is synthetic and is meant to resemble real world 
information in an anonymized format. It contains six different classes: Normal, 
Cyclic, Increasing trend, Decreasing trend, Upward shift, Downward shift. In 
this example we will use Mahout to cluster the data into corresponding class 
buckets. 
+
+*For the sake of simplicity, we won't use a cluster in this example, but 
instead show you the commands to run the clustering examples locally with 
Hadoop*.
+
+## Setup
+
+We need to do some initial setup before we are able to run the example. 
+
+
+  1. Start out by downloading the dataset to be clustered from the UCI Machine 
Learning Repository: 
[http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data](http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data).
+
+  2. Download the [latest release of Mahout](/general/downloads.html).
+
+  3. Unpack the release binary and switch to the *mahout-distribution-0.x* 
folder
+
+  4. Make sure that the *JAVA_HOME* environment variable points to your local 
java installation
+
+  5. Create a folder called *testdata* in the current directory and copy the 
dataset into this folder.
+
+
+## Clustering Examples
+
+Depending on the clustering algorithm you want to run, the following commands 
can be used:
+
+
+   * [Canopy Clustering](/users/clustering/canopy-clustering.html)
+
+    bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job
+
+   * [k-Means Clustering](/users/clustering/k-means-clustering.html)
+
+    bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
+
+
+   * [Fuzzy k-Means Clustering](/users/clustering/fuzzy-k-means.html)
+
+    bin/mahout org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job
+
+The clustering output will be produced in the *output* directory. The output 
data points are in vector format. In order to read/analyze the output, you can 
use the [clusterdump](/users/clustering/cluster-dumper.html) utility provided 
by Mahout.
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/clustering/clustering-seinfeld-episodes.md
----------------------------------------------------------------------
diff --git 
a/website/docs/tutorials/map-reduce/clustering/clustering-seinfeld-episodes.md 
b/website/docs/tutorials/map-reduce/clustering/clustering-seinfeld-episodes.md
new file mode 100644
index 0000000..40ed55f
--- /dev/null
+++ 
b/website/docs/tutorials/map-reduce/clustering/clustering-seinfeld-episodes.md
@@ -0,0 +1,11 @@
+---
+layout: mr_tutorial
+title: Clustering Seinfeld Episodes
+theme:
+   name: retro-mahout
+---
+
+Below is short tutorial on how to cluster Seinfeld episode transcripts with
+Mahout.
+
+http://blog.jteam.nl/2011/04/04/how-to-cluster-seinfeld-episodes-with-mahout/

http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/clustering/clusteringyourdata.md
----------------------------------------------------------------------
diff --git a/website/docs/tutorials/map-reduce/clustering/clusteringyourdata.md 
b/website/docs/tutorials/map-reduce/clustering/clusteringyourdata.md
new file mode 100644
index 0000000..782cbb5
--- /dev/null
+++ b/website/docs/tutorials/map-reduce/clustering/clusteringyourdata.md
@@ -0,0 +1,126 @@
+---
+layout: mr_tutorial
+title: ClusteringYourData
+theme:
+   name: retro-mahout
+---
+
+# Clustering your data
+
+After you've done the [Quickstart](quickstart.html) and are familiar with the 
basics of Mahout, it is time to cluster your own
+data. See also [Wikipedia on cluster 
analysis](en.wikipedia.org/wiki/Cluster_analysis) for more background.
+
+The following pieces *may* be useful for in getting started:
+
+<a name="ClusteringYourData-Input"></a>
+# Input
+
+For starters, you will need your data in an appropriate Vector format, see 
[Creating Vectors](../basics/creating-vectors.html).
+In particular for text preparation check out [Creating Vectors from 
Text](../basics/creating-vectors-from-text.html).
+
+
+<a name="ClusteringYourData-RunningtheProcess"></a>
+# Running the Process
+
+* [Canopy background](canopy-clustering.html) and 
[canopy-commandline](canopy-commandline.html).
+
+* [K-Means background](k-means-clustering.html), 
[k-means-commandline](k-means-commandline.html), and
+[fuzzy-k-means-commandline](fuzzy-k-means-commandline.html).
+
+* [Dirichlet background](dirichlet-process-clustering.html) and 
[dirichlet-commandline](dirichlet-commandline.html).
+
+* [Meanshift background](mean-shift-clustering.html) and 
[mean-shift-commandline](mean-shift-commandline.html).
+
+* [LDA (Latent Dirichlet Allocation) 
background](-latent-dirichlet-allocation.html) and 
[lda-commandline](lda-commandline.html).
+
+* TODO: kmeans++/ streaming kMeans documentation
+
+
+<a name="ClusteringYourData-RetrievingtheOutput"></a>
+# Retrieving the Output
+
+Mahout has a cluster dumper utility that can be used to retrieve and evaluate 
your clustering data.
+
+    ./bin/mahout clusterdump <OPTIONS>
+
+
+<a name="ClusteringYourData-Theclusterdumperoptionsare:"></a>
+## The cluster dumper options are:
+
+      --help (-h)                                 Print out help       
+           
+      --input (-i) input                          The directory containing 
Sequence    
+                                          Files for the Clusters           
+
+      --output (-o) output                        The output file.  If not 
specified,  
+                                          dumps to the console.
+
+      --outputFormat (-of) outputFormat           The optional output format 
to write
+                                          the results as. Options: TEXT, CSV, 
or GRAPH_ML               
+
+      --substring (-b) substring                  The number of chars of the   
    
+                                          asFormatString() to print    
+    
+      --pointsDir (-p) pointsDir                  The directory containing 
points  
+                                          sequence files mapping input vectors 
                                           to their cluster.  If specified, 
+                                          then the program will output the 
+                                          points associated with a cluster 
+
+      --dictionary (-d) dictionary                The dictionary file.         
    
+
+      --dictionaryType (-dt) dictionaryType    The dictionary file type        
    
+                                          (text|sequencefile)
+
+      --distanceMeasure (-dm) distanceMeasure  The classname of the 
DistanceMeasure.
+                                          Default is SquaredEuclidean.     
+
+      --numWords (-n) numWords            The number of top terms to print 
+
+      --tempDir tempDir                           Intermediate output directory
+
+      --startPhase startPhase             First phase to run
+
+      --endPhase endPhase                         Last phase to run
+
+      --evaluate (-e)                     Run ClusterEvaluator and 
CDbwEvaluator over the
+                                          input. The output will be appended 
to the rest of
+                                          the output at the end.   
+
+
+More information on using clusterdump utility can be found 
[here](cluster-dumper.html)
+
+<a name="ClusteringYourData-ValidatingtheOutput"></a>
+# Validating the Output
+
+{quote}
+Ted Dunning: A principled approach to cluster evaluation is to measure how 
well the
+cluster membership captures the structure of unseen data.  A natural
+measure for this is to measure how much of the entropy of the data is
+captured by cluster membership.  For k-means and its natural L_2 metric,
+the natural cluster quality metric is the squared distance from the nearest
+centroid adjusted by the log_2 of the number of clusters.  This can be
+compared to the squared magnitude of the original data or the squared
+deviation from the centroid for all of the data.  The idea is that you are
+changing the representation of the data by allocating some of the bits in
+your original representation to represent which cluster each point is in. 
+If those bits aren't made up by the residue being small then your
+clustering is making a bad trade-off.
+
+In the past, I have used other more heuristic measures as well.  One of the
+key characteristics that I would like to see out of a clustering is a
+degree of stability.  Thus, I look at the fractions of points that are
+assigned to each cluster or the distribution of distances from the cluster
+centroid. These values should be relatively stable when applied to held-out
+data.
+
+For text, you can actually compute perplexity which measures how well
+cluster membership predicts what words are used.  This is nice because you
+don't have to worry about the entropy of real valued numbers.
+
+Manual inspection and the so-called laugh test is also important.  The idea
+is that the results should not be so ludicrous as to make you laugh.
+Unfortunately, it is pretty easy to kid yourself into thinking your system
+is working using this kind of inspection.  The problem is that we are too
+good at seeing (making up) patterns.
+{quote}
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/clustering/fuzzy-k-means-commandline.md
----------------------------------------------------------------------
diff --git 
a/website/docs/tutorials/map-reduce/clustering/fuzzy-k-means-commandline.md 
b/website/docs/tutorials/map-reduce/clustering/fuzzy-k-means-commandline.md
new file mode 100644
index 0000000..96d9b23
--- /dev/null
+++ b/website/docs/tutorials/map-reduce/clustering/fuzzy-k-means-commandline.md
@@ -0,0 +1,97 @@
+---
+layout: mr_tutorial
+title: fuzzy-k-means-commandline
+theme:
+   name: retro-mahout
+---
+
+<a 
name="fuzzy-k-means-commandline-RunningFuzzyk-MeansClusteringfromtheCommandLine"></a>
+# Running Fuzzy k-Means Clustering from the Command Line
+Mahout's Fuzzy k-Means clustering can be launched from the same command
+line invocation whether you are running on a single machine in stand-alone
+mode or on a larger Hadoop cluster. The difference is determined by the
+$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to
+an operating Hadoop cluster on the target machine then the invocation will
+run FuzzyK on that cluster. If either of the environment variables are
+missing then the stand-alone Hadoop configuration will be invoked instead.
+
+
+    ./bin/mahout fkmeans <OPTIONS>
+
+
+* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.job
+
+
+<a name="fuzzy-k-means-commandline-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster
+
+* Put the data: cp <PATH TO DATA> testdata
+* Run the Job: 
+
+    ./bin/mahout fkmeans -i testdata <OPTIONS>
+
+
+<a name="fuzzy-k-means-commandline-Runningitonthecluster"></a>
+## Running it on the cluster
+
+* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+* Run the Job: 
+
+    export HADOOP_HOME=<Hadoop Home Directory>
+    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
+    ./bin/mahout fkmeans -i testdata <OPTIONS>
+
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="fuzzy-k-means-commandline-Commandlineoptions"></a>
+# Command line options
+
+      --input (-i) input                              Path to job input 
directory. 
+                                              Must be a SequenceFile of    
+                                              VectorWritable               
+      --clusters (-c) clusters                The input centroids, as Vectors. 
+                                              Must be a SequenceFile of    
+                                              Writable, Cluster/Canopy. If k  
+                                              is also specified, then a random 
+                                              set of vectors will be selected  
+                                              and written out to this path 
+                                              first                        
+      --output (-o) output                            The directory pathname 
for   
+                                              output.                      
+      --distanceMeasure (-dm) distanceMeasure      The classname of the        
    
+                                              DistanceMeasure. Default is  
+                                              SquaredEuclidean             
+      --convergenceDelta (-cd) convergenceDelta    The convergence delta 
value. 
+                                              Default is 0.5               
+      --maxIter (-x) maxIter                  The maximum number of        
+                                              iterations.                  
+      --k (-k) k                                      The k in k-Means.  If 
specified, 
+                                              then a random selection of k 
+                                              Vectors will be chosen as the
+                                                      Centroid and written to 
the  
+                                              clusters input path.         
+      --m (-m) m                                      coefficient 
normalization    
+                                              factor, must be greater than 1   
+      --overwrite (-ow)                               If present, overwrite 
the output 
+                                              directory before running job 
+      --help (-h)                                     Print out help           
    
+      --numMap (-u) numMap                            The number of map tasks. 
    
+                                              Defaults to 10               
+      --maxRed (-r) maxRed                            The number of reduce 
tasks.  
+                                              Defaults to 2                
+      --emitMostLikely (-e) emitMostLikely            True if clustering 
should emit   
+                                              the most likely point only,  
+                                              false for threshold clustering.  
+                                              Default is true              
+      --threshold (-t) threshold                      The pdf threshold used 
for   
+                                              cluster determination. Default   
+                                              is 0 
+      --clustering (-cl)                              If present, run 
clustering after 
+                                              the iterations have taken place  
+                                
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/clustering/k-means-commandline.md
----------------------------------------------------------------------
diff --git 
a/website/docs/tutorials/map-reduce/clustering/k-means-commandline.md 
b/website/docs/tutorials/map-reduce/clustering/k-means-commandline.md
new file mode 100644
index 0000000..9dff963
--- /dev/null
+++ b/website/docs/tutorials/map-reduce/clustering/k-means-commandline.md
@@ -0,0 +1,94 @@
+---
+layout: mr_tutorial
+title: k-means-commandline
+theme:
+   name: retro-mahout
+---
+
+<a name="k-means-commandline-Introduction"></a>
+# kMeans commandline introduction
+
+This quick start page describes how to run the kMeans clustering algorithm
+on a Hadoop cluster. 
+
+<a name="k-means-commandline-Steps"></a>
+# Steps
+
+Mahout's k-Means clustering can be launched from the same command line
+invocation whether you are running on a single machine in stand-alone mode
+or on a larger Hadoop cluster. The difference is determined by the
+$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to
+an operating Hadoop cluster on the target machine then the invocation will
+run k-Means on that cluster. If either of the environment variables are
+missing then the stand-alone Hadoop configuration will be invoked instead.
+
+
+    ./bin/mahout kmeans <OPTIONS>
+
+
+In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.job
+
+
+<a name="k-means-commandline-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster
+
+* Put the data: cp <PATH TO DATA> testdata
+* Run the Job: 
+
+    ./bin/mahout kmeans -i testdata -o output -c clusters -dm
+org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k
+25
+
+
+<a name="k-means-commandline-Runningitonthecluster"></a>
+## Running it on the cluster
+
+* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+* Run the Job: 
+
+    export HADOOP_HOME=<Hadoop Home Directory>
+    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
+    ./bin/mahout kmeans -i testdata -o output -c clusters -dm 
org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k 25
+
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="k-means-commandline-Commandlineoptions"></a>
+# Command line options
+
+      --input (-i) input                              Path to job input 
directory. 
+                                              Must be a SequenceFile of    
+                                              VectorWritable               
+      --clusters (-c) clusters                The input centroids, as Vectors. 
+                                              Must be a SequenceFile of    
+                                              Writable, Cluster/Canopy. If k  
+                                              is also specified, then a random 
+                                              set of vectors will be selected  
+                                              and written out to this path 
+                                              first                        
+      --output (-o) output                            The directory pathname 
for   
+                                              output.                      
+      --distanceMeasure (-dm) distanceMeasure      The classname of the        
    
+                                              DistanceMeasure. Default is  
+                                              SquaredEuclidean             
+      --convergenceDelta (-cd) convergenceDelta    The convergence delta 
value. 
+                                              Default is 0.5               
+      --maxIter (-x) maxIter                  The maximum number of        
+                                              iterations.                  
+      --maxRed (-r) maxRed                            The number of reduce 
tasks.  
+                                              Defaults to 2                
+      --k (-k) k                                      The k in k-Means.  If 
specified, 
+                                              then a random selection of k 
+                                              Vectors will be chosen as the    
+                                              Centroid and written to the  
+                                              clusters input path.         
+      --overwrite (-ow)                               If present, overwrite 
the output 
+                                              directory before running job 
+      --help (-h)                                     Print out help           
    
+      --clustering (-cl)                              If present, run 
clustering after 
+                                              the iterations have taken place  
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/clustering/lda-commandline.md
----------------------------------------------------------------------
diff --git a/website/docs/tutorials/map-reduce/clustering/lda-commandline.md 
b/website/docs/tutorials/map-reduce/clustering/lda-commandline.md
new file mode 100644
index 0000000..e4d7014
--- /dev/null
+++ b/website/docs/tutorials/map-reduce/clustering/lda-commandline.md
@@ -0,0 +1,83 @@
+---
+layout: mr_tutorial
+title: lda-commandline
+theme:
+   name: retro-mahout
+---
+
+<a 
name="lda-commandline-RunningLatentDirichletAllocation(algorithm)fromtheCommandLine"></a>
+# Running Latent Dirichlet Allocation (algorithm) from the Command Line
+[Since Mahout v0.6](https://issues.apache.org/jira/browse/MAHOUT-897)
+ lda has been implemented as Collapsed Variable Bayes (cvb). 
+
+Mahout's LDA can be launched from the same command line invocation whether
+you are running on a single machine in stand-alone mode or on a larger
+Hadoop cluster. The difference is determined by the $HADOOP_HOME and
+$HADOOP_CONF_DIR environment variables. If both are set to an operating
+Hadoop cluster on the target machine then the invocation will run the LDA
+algorithm on that cluster. If either of the environment variables are
+missing then the stand-alone Hadoop configuration will be invoked instead.
+
+
+
+    ./bin/mahout cvb <OPTIONS>
+
+
+* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.job
+
+
+<a name="lda-commandline-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster
+
+* Put the data: cp <PATH TO DATA> testdata
+* Run the Job: 
+
+    ./bin/mahout cvb -i testdata <OTHER OPTIONS>
+
+
+<a name="lda-commandline-Runningitonthecluster"></a>
+## Running it on the cluster
+
+* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+* Run the Job: 
+
+    export HADOOP_HOME=<Hadoop Home Directory>
+    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
+    ./bin/mahout cvb -i testdata <OTHER OPTIONS>
+
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="lda-commandline-CommandlineoptionsfromMahoutcvbversion0.8"></a>
+# Command line options from Mahout cvb version 0.8
+
+    mahout cvb -h 
+      --input (-i) input                                         Path to job 
input directory.        
+      --output (-o) output                                       The directory 
pathname for output.  
+      --maxIter (-x) maxIter                             The maximum number of 
iterations.             
+      --convergenceDelta (-cd) convergenceDelta                  The 
convergence delta value               
+      --overwrite (-ow)                                          If present, 
overwrite the output directory before running job    
+      --num_topics (-k) num_topics                               Number of 
topics to learn              
+      --num_terms (-nt) num_terms                                Vocabulary 
size   
+      --doc_topic_smoothing (-a) doc_topic_smoothing     Smoothing for 
document/topic distribution          
+      --term_topic_smoothing (-e) term_topic_smoothing   Smoothing for 
topic/term distribution          
+      --dictionary (-dict) dictionary                    Path to 
term-dictionary file(s) (glob expression supported) 
+      --doc_topic_output (-dt) doc_topic_output                  Output path 
for the training doc/topic distribution        
+      --topic_model_temp_dir (-mt) topic_model_temp_dir          Path to 
intermediate model path (useful for restarting)       
+      --iteration_block_size (-block) iteration_block_size       Number of 
iterations per perplexity check  
+      --random_seed (-seed) random_seed                          Random seed   
    
+      --test_set_fraction (-tf) test_set_fraction                Fraction of 
data to hold out for testing  
+      --num_train_threads (-ntt) num_train_threads               number of 
threads per mapper to train with  
+      --num_update_threads (-nut) num_update_threads     number of threads per 
mapper to update the model with        
+      --max_doc_topic_iters (-mipd) max_doc_topic_iters          max number of 
iterations per doc for p(topic|doc) learning              
+      --num_reduce_tasks num_reduce_tasks                        number of 
reducers to use during model estimation        
+      --backfill_perplexity                              enable backfilling of 
missing perplexity values               
+      --help (-h)                                                Print out 
help    
+      --tempDir tempDir                                          Intermediate 
output directory      
+      --startPhase startPhase                            First phase to run    
+      --endPhase endPhase                                        Last phase to 
run
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/clustering/viewing-result.md
----------------------------------------------------------------------
diff --git a/website/docs/tutorials/map-reduce/clustering/viewing-result.md 
b/website/docs/tutorials/map-reduce/clustering/viewing-result.md
new file mode 100644
index 0000000..e7c6107
--- /dev/null
+++ b/website/docs/tutorials/map-reduce/clustering/viewing-result.md
@@ -0,0 +1,15 @@
+---
+layout: mr_tutorial
+title: Viewing Result
+theme:
+   name: retro-mahout
+---
+* [Algorithm Viewing pages](#ViewingResult-AlgorithmViewingpages)
+
+There are various technologies available to view the output of Mahout
+algorithms.
+* Clusters
+
+<a name="ViewingResult-AlgorithmViewingpages"></a>
+# Algorithm Viewing pages
+{pagetree:root=@self|excerpt=true|expandCollapseAll=true}

http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/clustering/viewing-results.md
----------------------------------------------------------------------
diff --git a/website/docs/tutorials/map-reduce/clustering/viewing-results.md 
b/website/docs/tutorials/map-reduce/clustering/viewing-results.md
new file mode 100644
index 0000000..abbe7ef
--- /dev/null
+++ b/website/docs/tutorials/map-reduce/clustering/viewing-results.md
@@ -0,0 +1,49 @@
+---
+layout: mr_tutorial
+title: Viewing Results
+theme:
+   name: retro-mahout
+---
+<a name="ViewingResults-Intro"></a>
+# Intro
+
+Many of the Mahout libraries run as batch jobs, dumping results into Hadoop
+sequence files or other data structures.  This page is intended to
+demonstrate the various ways one might inspect the outcome of various jobs.
+ The page is organized by algorithms.
+
+<a name="ViewingResults-GeneralUtilities"></a>
+# General Utilities
+
+<a name="ViewingResults-SequenceFileDumper"></a>
+## Sequence File Dumper
+
+
+<a name="ViewingResults-Clustering"></a>
+# Clustering
+
+<a name="ViewingResults-ClusterDumper"></a>
+## Cluster Dumper
+
+Run the following to print out all options:
+
+    java  -cp "*" org.apache.mahout.utils.clustering.ClusterDumper --help
+
+
+
+<a name="ViewingResults-Example"></a>
+### Example
+
+    java  -cp "*" org.apache.mahout.utils.clustering.ClusterDumper --seqFileDir
+./solr-clust-n2/out/clusters-2
+          --dictionary ./solr-clust-n2/dictionary.txt
+          --substring 100 --pointsDir ./solr-clust-n2/out/points/
+    
+
+
+
+<a name="ViewingResults-ClusterLabels(MAHOUT-163)"></a>
+## Cluster Labels (MAHOUT-163)
+
+<a name="ViewingResults-Classification"></a>
+# Classification

http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/clustering/visualizing-sample-clusters.md
----------------------------------------------------------------------
diff --git 
a/website/docs/tutorials/map-reduce/clustering/visualizing-sample-clusters.md 
b/website/docs/tutorials/map-reduce/clustering/visualizing-sample-clusters.md
new file mode 100644
index 0000000..9d6c827
--- /dev/null
+++ 
b/website/docs/tutorials/map-reduce/clustering/visualizing-sample-clusters.md
@@ -0,0 +1,50 @@
+---
+layout: mr_tutorial
+title: Visualizing Sample Clusters
+theme:
+   name: retro-mahout
+---
+
+<a name="VisualizingSampleClusters-Introduction"></a>
+# Introduction
+
+Mahout provides examples to visualize sample clusters that gets created by
+our clustering algorithms. Note that the visualization is done by Swing 
programs. You have to be in a window system on the same
+machine you run these, or logged in via a remote desktop.
+
+For visualizing the clusters, you have to execute the Java
+classes under *org.apache.mahout.clustering.display* package in
+mahout-examples module. The easiest way to achieve this is to [setup 
Mahout](users/basics/quickstart.html) in your IDE.
+
+<a name="VisualizingSampleClusters-Visualizingclusters"></a>
+# Visualizing clusters
+
+The following classes in *org.apache.mahout.clustering.display* can be run
+without parameters to generate a sample data set and run the reference
+clustering implementations over them:
+
+1. **DisplayClustering** - generates 1000 samples from three, symmetric
+distributions. This is the same data set that is used by the following
+clustering programs. It displays the points on a screen and superimposes
+the model parameters that were used to generate the points. You can edit
+the *generateSamples()* method to change the sample points used by these
+programs.
+1. **DisplayClustering** - displays initial areas of generated points
+1. **DisplayCanopy** - uses Canopy clustering
+1. **DisplayKMeans** - uses k-Means clustering
+1. **DisplayFuzzyKMeans** - uses Fuzzy k-Means clustering
+1. **DisplaySpectralKMeans** - uses Spectral KMeans via map-reduce algorithm
+
+If you are using Eclipse, just right-click on each of the classes mentioned 
above and choose "Run As -Java Application". To run these directly from the 
command line:
+
+    cd $MAHOUT_HOME/examples
+    mvn -q exec:java 
-Dexec.mainClass=org.apache.mahout.clustering.display.DisplayClustering
+
+You can substitute other names above for *DisplayClustering*. 
+
+
+Note that some of these programs display the sample points and then 
superimpose all of the clusters from each iteration. The last iteration's 
clusters are in
+bold red and the previous several are colored (orange, yellow, green, blue,
+magenta) in order after which all earlier clusters are in light grey. This
+helps to visualize how the clusters converge upon a solution over multiple
+iterations.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/misc/mr---map-reduce.md
----------------------------------------------------------------------
diff --git a/website/docs/tutorials/map-reduce/misc/mr---map-reduce.md 
b/website/docs/tutorials/map-reduce/misc/mr---map-reduce.md
new file mode 100644
index 0000000..b03d6ad
--- /dev/null
+++ b/website/docs/tutorials/map-reduce/misc/mr---map-reduce.md
@@ -0,0 +1,19 @@
+---
+layout: default
+title: MR - Map Reduce
+theme:
+   name: retro-mahout
+---
+
+{excerpt}MapReduce is a framework for processing huge datasets on certain
+kinds of distributable problems using a large number of computers (nodes),
+collectively referred to as a cluster.{excerpt} Computational processing
+can occur on data stored either in a filesystem (unstructured) or within a
+database (structured).
+
+&nbsp; Also written M/R
+
+
+&nbsp; See Also
+* 
[http://wiki.apache.org/hadoop/HadoopMapReduce](http://wiki.apache.org/hadoop/HadoopMapReduce)
+* 
[http://en.wikipedia.org/wiki/MapReduce](http://en.wikipedia.org/wiki/MapReduce)

http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/misc/parallel-frequent-pattern-mining.md
----------------------------------------------------------------------
diff --git 
a/website/docs/tutorials/map-reduce/misc/parallel-frequent-pattern-mining.md 
b/website/docs/tutorials/map-reduce/misc/parallel-frequent-pattern-mining.md
new file mode 100644
index 0000000..e2978a4
--- /dev/null
+++ b/website/docs/tutorials/map-reduce/misc/parallel-frequent-pattern-mining.md
@@ -0,0 +1,185 @@
+---
+layout: default
+title: Parallel Frequent Pattern Mining
+theme:
+    name: retro-mahout
+---
+Mahout has a Top K Parallel FPGrowth Implementation. Its based on the paper 
[http://infolab.stanford.edu/~echang/recsys08-69.pdf](http://infolab.stanford.edu/~echang/recsys08-69.pdf)
+ with some optimisations in mining the data.
+
+Given a huge transaction list, the algorithm finds all unique features(sets
+of field values) and eliminates those features whose frequency in the whole
+dataset is less that minSupport. Using these remaining features N, we find
+the top K closed patterns for each of them, generating a total of NxK
+patterns. FPGrowth Algorithm is a generic implementation, we can use any
+Object type to denote a feature. Current implementation requires you to use
+a String as the object type. You may implement a version for any object by
+creating Iterators, Convertors and TopKPatternWritable for that particular
+object. For more information please refer the package
+org.apache.mahout.fpm.pfpgrowth.convertors.string
+
+    e.g:
+     FPGrowth<String> fp = new FPGrowth<String>();
+     Set<String> features = new HashSet<String>();
+     fp.generateTopKStringFrequentPatterns(
+         new StringRecordIterator(new FileLineIterable(new File(input),
+encoding, false), pattern),
+       fp.generateFList(
+         new StringRecordIterator(new FileLineIterable(new File(input),
+encoding, false), pattern), minSupport),
+        minSupport,
+       maxHeapSize,
+       features,
+       new StringOutputConvertor(new SequenceFileOutputCollector<Text,
+TopKStringPatterns>(writer))
+      );
+
+* The first argument is the iterator of transaction in this case its
+Iterator<List<String>>
+* The second argument is the output of generateFList function, which
+returns the frequent items and their frequencies from the given database
+transaction iterator
+* The third argument is the minimum Support of the pattern to be generated
+* The fourth argument is the maximum number of patterns to be mined for
+each feature
+* The fifth argument is the set of features for which the frequent patterns
+has to be mined
+* The last argument is an output collector which takes \[key, 
value\](key,-value\.html)
+ of Feature and TopK Patterns of the format \[String,
+List<Pair<List<String>, Long>>\] and writes them to the appropriate writer
+class which takes care of storing the object, in this case in a Sequence
+File Output format
+
+<a 
name="ParallelFrequentPatternMining-RunningFrequentPatternGrowthviacommandline"></a>
+## Running Frequent Pattern Growth via command line
+
+The command line launcher for string transaction data
+org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver has other features including
+specifying the regex pattern for spitting a string line of a transaction
+into the constituent features.
+
+Input files have to be in the following format.
+
+<optional document id>TAB<TOKEN1>SPACE<TOKEN2>SPACE....
+
+instead of tab you could use , or \| as the default tokenization is done using 
a java Regex pattern {code}[,\t](,\t.html)
+*[,|\t][ ,\t]*{code}
+You can override this parameter to parse your log files or transaction
+files (each line is a transaction.) The FPGrowth algorithm mines the top K
+frequently occurring sets of items and their counts from the given input
+data
+
+$MAHOUT_HOME/core/src/test/resources/retail.dat is a sample dataset in this
+format. 
+Other sample files are accident.dat.gz from 
[http://fimi.cs.helsinki.fi/data/](http://fimi.cs.helsinki.fi/data/)
+. As a quick test, try this:
+
+
+    bin/mahout fpg \
+         -i core/src/test/resources/retail.dat \
+         -o patterns \
+         -k 50 \
+         -method sequential \
+         -regex '[\ ]
+' \
+         -s 2
+
+
+The minimumSupport parameter \-s is the minimum number of times a pattern
+or a feature needs to occur in the dataset so that it is included in the
+patterns generated. You can speed up the process by having a large value of
+s. There are cases where you will have less than k patterns for a
+particular feature as the rest don't for qualify the minimum support
+criteria
+
+Note that the input to the algorithm, could be uncompressed or compressed
+gz file or even a directory containing any number of such files.
+We modified the regex to use space to split the token. Note that input
+regex string is escaped.
+
+<a name="ParallelFrequentPatternMining-RunningParallelFPGrowth"></a>
+## Running Parallel FPGrowth
+
+Running parallel FPGrowth is as easy as adding changing the flag \-method
+mapreduce and adding the number of groups parameter e.g. \-g 20 for 20
+groups. First, let's run the above sample test in map-reduce mode:
+
+    bin/mahout fpg \
+         -i core/src/test/resources/retail.dat \
+         -o patterns \
+         -k 50 \
+         -method mapreduce \
+         -regex '[\ ]
+' \
+         -s 2
+
+The above test took 102 seconds on dual-core laptop, v.s. 609 seconds in
+the sequential mode, (with 5 gigs of ram allocated). In a separate test,
+the first 1000 lines of retail.dat took 20 seconds in map/reduce v.s. 30
+seconds in sequential mode.
+
+Here is another dataset which, while several times larger, requires much
+less time to find frequent patterns, as there are very few. Get
+accidents.dat.gz from 
[http://fimi.cs.helsinki.fi/data/](http://fimi.cs.helsinki.fi/data/)
+ and place it on your hdfs in a folder named accidents. Then, run the
+hadoop version of the FPGrowth job:
+
+    bin/mahout fpg \
+         -i accidents \
+         -o patterns \
+         -k 50 \
+         -method mapreduce \
+         -regex '[\ ]
+' \
+         -s 2
+
+
+OR to run a dataset of this size in sequential mode on a single machine
+let's give Mahout a lot more memory and only keep features with more than
+300 members:
+
+    export MAHOUT_HEAPSIZE=-Xmx5000m
+    bin/mahout fpg \
+         -i accidents \
+         -o patterns \
+         -k 50 \
+         -method sequential \
+         -regex '[\ ]
+' \
+         -s 2
+
+
+
+The numGroups parameter \-g in FPGrowthJob specifies the number of groups
+into which transactions have to be decomposed. The default of 1000 works
+very well on a single-machine cluster; this may be very different on large
+clusters.
+
+Note that accidents.dat has 340 unique features. So we chose \-g 10 to
+split the transactions across 10 shards where 34 patterns are mined from
+each shard. (Note: g doesnt need to be exactly divisible.) The Algorithm
+takes care of calculating the split. For better performance in large
+datasets and clusters, try not to mine for more than 20-25 features per
+shard. Stick to the defaults on a small machine.
+
+The numTreeCacheEntries parameter \-tc specifies the number of generated
+conditional FP-Trees to be kept in memory so that subsequent operations do
+not to regenerate them. Increasing this number increases the memory
+consumption but might improve speed until a certain point. This depends
+entirely on the dataset in question. A value of 5-10 is recommended for
+mining up to top 100 patterns for each feature.
+
+<a name="ParallelFrequentPatternMining-Viewingtheresults"></a>
+## Viewing the results
+The output will be dumped to a SequenceFile in the frequentpatterns
+directory in Text=>TopKStringPatterns format. Run this command to see a few
+of the Frequent Patterns:
+
+    bin/mahout seqdumper \
+         -i patterns/frequentpatterns/part-?-00000 \
+         -n 4
+
+or replace -n 4 with -c for the count of patterns.
+ 
+Open questions: how does one experiment and monitor with these various
+parameters?

http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/misc/perceptron-and-winnow.md
----------------------------------------------------------------------
diff --git a/website/docs/tutorials/map-reduce/misc/perceptron-and-winnow.md 
b/website/docs/tutorials/map-reduce/misc/perceptron-and-winnow.md
new file mode 100644
index 0000000..308040c
--- /dev/null
+++ b/website/docs/tutorials/map-reduce/misc/perceptron-and-winnow.md
@@ -0,0 +1,41 @@
+---
+layout: default
+title: Perceptron and Winnow
+theme:
+    name: retro-mahout
+---
+<a name="PerceptronandWinnow-ClassificationwithPerceptronorWinnow"></a>
+# Classification with Perceptron or Winnow
+
+Both algorithms are comparably simple linear classifiers. Given training
+data in some n-dimensional vector space that is annotated with binary
+labels the algorithms are guaranteed to find a linear separating hyperplane
+if one exists. In contrast to the Perceptron, Winnow works only for binary
+feature vectors.
+
+For more information on the Perceptron see for instance:
+http://en.wikipedia.org/wiki/Perceptron
+
+Concise course notes on both algorithms:
+http://pages.cs.wisc.edu/~shuchi/courses/787-F07/scribe-notes/lecture24.pdf
+
+Although the algorithms are comparably simple they still work pretty well
+for text classification and are fast to train even for huge example sets.
+In contrast to Naive Bayes they are not based on the assumption that all
+features (in the domain of text classification: all terms in a document)
+are independent.
+
+<a name="PerceptronandWinnow-Strategyforparallelisation"></a>
+## Strategy for parallelisation
+
+Currently the strategy for parallelisation is simple: Given there is enough
+training data, split the training data. Train the classifier on each split.
+The resulting hyperplanes are then averaged.
+
+<a name="PerceptronandWinnow-Roadmap"></a>
+## Roadmap
+
+Currently the patch only contains the code for the classifier itself. It is
+planned to provide unit tests and at least one example based on the WebKB
+dataset by the end of November for the serial version. After that the
+parallelisation will be added.

http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/misc/testing.md
----------------------------------------------------------------------
diff --git a/website/docs/tutorials/map-reduce/misc/testing.md 
b/website/docs/tutorials/map-reduce/misc/testing.md
new file mode 100644
index 0000000..dc3fd43
--- /dev/null
+++ b/website/docs/tutorials/map-reduce/misc/testing.md
@@ -0,0 +1,46 @@
+---
+layout: default
+title: Testing
+theme:
+    name: retro-mahout
+---
+<a name="Testing-Intro"></a>
+# Intro
+
+As Mahout matures, solid testing procedures are needed.  This page and its
+children capture test plans along with ideas for improving our testing.
+
+<a name="Testing-TestPlans"></a>
+# Test Plans
+
+* [0.6](0.6.html)
+ - Test Plans for the 0.6 release
+There are no special plans except for unit tests, and user testing of the
+Hadoop jobs.
+
+<a name="Testing-TestIdeas"></a>
+# Test Ideas
+
+<a name="Testing-Regressions/Benchmarks/Integrations"></a>
+## Regressions/Benchmarks/Integrations
+* Algorithmic quality and speed are not tested, except in a few instances.
+Such tests often require much longer run times (minutes to hours), a
+running Hadoop cluster, and downloads of large datasets (in the megabytes). 
+* Standardized speed tests are difficult on different hardware. 
+* Unit tests of external integrations require access to externals: HDFS,
+S3, JDBC, Cassandra, etc. 
+
+Apache Jenkins is not able to support these environments. Commercial
+donations would help. 
+
+<a name="Testing-UnitTests"></a>
+## Unit Tests
+Mahout's current tests are almost entirely unit tests. Algorithm tests
+generally supply a few numbers to code paths and verify that expected
+numbers come out. 'mvn test' runs these tests. There is "positive" coverage
+of a great many utilities and algorithms. A much smaller percent include
+"negative" coverage (bogus setups, inputs, combinations).
+
+<a name="Testing-Other"></a>
+## Other
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/misc/using-mahout-with-python-via-jpype.md
----------------------------------------------------------------------
diff --git 
a/website/docs/tutorials/map-reduce/misc/using-mahout-with-python-via-jpype.md 
b/website/docs/tutorials/map-reduce/misc/using-mahout-with-python-via-jpype.md
new file mode 100644
index 0000000..57378ba
--- /dev/null
+++ 
b/website/docs/tutorials/map-reduce/misc/using-mahout-with-python-via-jpype.md
@@ -0,0 +1,222 @@
+---
+layout: default
+title: Using Mahout with Python via JPype
+theme:
+    name: retro-mahout
+---
+
+<a name="UsingMahoutwithPythonviaJPype-overview"></a>
+# Mahout over Jython - some examples
+This tutorial provides some sample code illustrating how we can read and
+write sequence files containing Mahout vectors from Python using JPype.
+This tutorial is intended for people who want to use Python for analyzing
+and plotting Mahout data. Using Mahout from Python turns out to be quite
+easy.
+
+This tutorial concerns the use of cPython (cython) as opposed to Jython.
+JPython wasn't an option for me, because  (to the best of my knowledge)
+JPython doesn't work with Python extensions numpy, matplotlib, or h5py
+which I rely on heavily.
+
+The instructions below explain how to setup a python script to read and
+write the output of Mahout clustering.
+
+You will first need to download and install the JPype package for python.
+
+The first step to setting up JPype is determining the path to the dynamic
+library for the jvm ; on linux this will be a .so file on and on windows it
+will be a .dll.
+
+In your python script, create a global variable with the path to this dll
+
+
+
+Next we need to figure out how we need to set the classpath for mahout. The
+easiest way to do this is to edit the script in "bin/mahout" to print out
+the classpath. Add the line "echo $CLASSPATH" to the script somewhere after
+the comment "run it" (this is line 195 or so). Execute the script to print
+out the classpath.  Copy this output and paste it into a variable in your
+python script. The result for me looks like the following
+
+
+
+
+Now we can create a function to start the jvm in python using jype
+
+    jvm=None
+    def start_jpype():
+    global jvm
+    if (jvm is None):
+    cpopt="-Djava.class.path={cp}".format(cp=classpath)
+    startJVM(jvmlib,"-ea",cpopt)
+    jvm="started"
+
+
+
+<a 
name="UsingMahoutwithPythonviaJPype-WritingNamedVectorstoSequenceFilesfromPython"></a>
+# Writing Named Vectors to Sequence Files from Python
+We can now use JPype to create sequence files which will contain vectors to
+be used by Mahout for kmeans. The example below is a function which creates
+vectors from two Gaussian distributions with unit variance.
+
+
+    def create_inputs(ifile,*args,**param):
+     """Create a sequence file containing some normally distributed
+       ifile - path to the sequence file to create
+     """
+     
+     #matrix of the cluster means
+     cmeans=np.array([[1,1] ,[-1,-1]],np.int)
+     
+     nperc=30  #number of points per cluster
+     
+     vecs=[]
+     
+     vnames=[]
+     for cind in range(cmeans.shape[0]):
+      pts=np.random.randn(nperc,2)
+      pts=pts+cmeans[cind,:].reshape([1,cmeans.shape[1]])
+      vecs.append(pts)
+     
+      #names for the vectors
+      #names are just the points with an index
+      #we do this so we can validate by cross-refencing the name with thevector
+      vn=np.empty(nperc,dtype=(np.str,30))
+      for row in range(nperc):
+       
vn[row]="c"+str(cind)+"_"+pts[row,0].astype((np.str,4))+"_"+pts[row,1].astype((np.str,4))
+      vnames.append(vn)
+      
+     vecs=np.vstack(vecs)
+     vnames=np.hstack(vnames)
+     
+    
+     #start the jvm
+     start_jpype()
+     
+     #create the sequence file that we will write to
+     io=JPackage("org").apache.hadoop.io 
+     FileSystemCls=JPackage("org").apache.hadoop.fs.FileSystem
+     
+     PathCls=JPackage("org").apache.hadoop.fs.Path
+     path=PathCls(ifile)
+    
+     ConfCls=JPackage("org").apache.hadoop.conf.Configuration 
+     conf=ConfCls()
+     
+     fs=FileSystemCls.get(conf)
+     
+     #vector classes
+     VectorWritableCls=JPackage("org").apache.mahout.math.VectorWritable
+     DenseVectorCls=JPackage("org").apache.mahout.math.DenseVector
+     NamedVectorCls=JPackage("org").apache.mahout.math.NamedVector
+     writer=io.SequenceFile.createWriter(fs, conf, 
path,io.Text,VectorWritableCls)
+     
+     
+     vecwritable=VectorWritableCls()
+     for row in range(vecs.shape[0]):
+      
nvector=NamedVectorCls(DenseVectorCls(JArray(JDouble,1)(vecs[row,:])),vnames[row])
+      #need to wrap key and value because of overloading
+      wrapkey=JObject(io.Text("key "+str(row)),io.Writable)
+      wrapval=JObject(vecwritable,io.Writable)
+      
+      vecwritable.set(nvector)
+      writer.append(wrapkey,wrapval)
+      
+     writer.close()
+
+
+<a 
name="UsingMahoutwithPythonviaJPype-ReadingtheKMeansClusteredPointsfromPython"></a>
+# Reading the KMeans Clustered Points from Python
+Similarly we can use JPype to easily read the clustered points outputted by
+mahout.
+
+    def read_clustered_pts(ifile,*args,**param):
+     """Read the clustered points
+     ifile - path to the sequence file containing the clustered points
+     """ 
+    
+     #start the jvm
+     start_jpype()
+     
+     #create the sequence file that we will write to
+     io=JPackage("org").apache.hadoop.io 
+     FileSystemCls=JPackage("org").apache.hadoop.fs.FileSystem
+     
+     PathCls=JPackage("org").apache.hadoop.fs.Path
+     path=PathCls(ifile)
+    
+     ConfCls=JPackage("org").apache.hadoop.conf.Configuration 
+     conf=ConfCls()
+     
+     fs=FileSystemCls.get(conf)
+     
+     #vector classes
+     VectorWritableCls=JPackage("org").apache.mahout.math.VectorWritable
+     NamedVectorCls=JPackage("org").apache.mahout.math.NamedVector
+     
+     
+     ReaderCls=io.__getattribute__("SequenceFile$Reader") 
+     reader=ReaderCls(fs, path,conf)
+     
+    
+     key=reader.getKeyClass()()
+     
+    
+     valcls=reader.getValueClass()
+     vecwritable=valcls()
+     while (reader.next(key,vecwritable)):     
+      weight=vecwritable.getWeight()
+      nvec=vecwritable.getVector()
+      
+      cname=nvec.__class__.__name__
+      if (cname.rsplit('.',1)[1]=="NamedVector"):  
+       print "cluster={key} Name={name} 
x={x}y={y}".format(key=key.toString(),name=nvec.getName(),x=nvec.get(0),y=nvec.get(1))
+      else:
+       raise NotImplementedError("Vector isn't a NamedVector. Need 
tomodify/test the code to handle this case.")
+
+
+<a name="UsingMahoutwithPythonviaJPype-ReadingtheKMeansCentroids"></a>
+# Reading the KMeans Centroids
+Finally we can create a function to print out the actual cluster centers
+found by mahout,
+
+    def getClusters(ifile,*args,**param):
+     """Read the centroids from the clusters outputted by kmenas
+          ifile - Path to the sequence file containing the centroids
+     """ 
+    
+     #start the jvm
+     start_jpype()
+     
+     #create the sequence file that we will write to
+     io=JPackage("org").apache.hadoop.io 
+     FileSystemCls=JPackage("org").apache.hadoop.fs.FileSystem
+     
+     PathCls=JPackage("org").apache.hadoop.fs.Path
+     path=PathCls(ifile)
+    
+     ConfCls=JPackage("org").apache.hadoop.conf.Configuration 
+     conf=ConfCls()
+     
+     fs=FileSystemCls.get(conf)
+     
+     #vector classes
+     VectorWritableCls=JPackage("org").apache.mahout.math.VectorWritable
+     NamedVectorCls=JPackage("org").apache.mahout.math.NamedVector
+     ReaderCls=io.__getattribute__("SequenceFile$Reader")
+     reader=ReaderCls(fs, path,conf)
+     
+    
+     key=io.Text()
+     
+    
+     valcls=reader.getValueClass()
+    
+     vecwritable=valcls()
+     
+     while (reader.next(key,vecwritable)):     
+      center=vecwritable.getCenter()
+      
+      print 
"id={cid}center={center}".format(cid=vecwritable.getId(),center=center.values)
+      pass
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/recommender/intro-als-hadoop.md
----------------------------------------------------------------------
diff --git a/website/docs/tutorials/map-reduce/recommender/intro-als-hadoop.md 
b/website/docs/tutorials/map-reduce/recommender/intro-als-hadoop.md
new file mode 100644
index 0000000..2acacd0
--- /dev/null
+++ b/website/docs/tutorials/map-reduce/recommender/intro-als-hadoop.md
@@ -0,0 +1,98 @@
+---
+layout: default
+title: Perceptron and Winnow
+theme:
+    name: retro-mahout
+---
+
+# Introduction to ALS Recommendations with Hadoop
+
+##Overview
+
+Mahoutâs ALS recommender is a matrix factorization algorithm that uses 
Alternating Least Squares with Weighted-Lamda-Regularization (ALS-WR). It 
factors the user to item matrix *A* into the user-to-feature matrix *U* and the 
item-to-feature matrix *M*: It runs the ALS algorithm in a parallel fashion. 
The algorithm details can be referred to in the following papers: 
+
+* [Large-scale Parallel Collaborative Filtering for
+the Netflix 
Prize](http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08%28submitted%29.pdf)
+* [Collaborative Filtering for Implicit Feedback 
Datasets](http://research.yahoo.com/pub/2433) 
+
+This recommendation algorithm can be used in eCommerce platform to recommend 
products to customers. Unlike the user or item based recommenders that computes 
the similarity of users or items to make recommendations, the ALS algorithm 
uncovers the latent factors that explain the observed user to item ratings and 
tries to find optimal factor weights to minimize the least squares between 
predicted and actual ratings.
+
+Mahout's ALS recommendation algorithm takes as input user preferences by item 
and generates an output of recommending items for a user. The input customer 
preference could either be explicit user ratings or implicit feedback such as 
user's click on a web page.
+
+One of the strengths of the ALS based recommender, compared to the user or 
item based recommender, is its ability to handle large sparse data sets and its 
better prediction performance. It could also gives an intuitive rationale of 
the factors that influence recommendations.
+
+##Implementation
+At present Mahout has a map-reduce implementation of ALS, which is composed of 
2 jobs: a parallel matrix factorization job and a recommendation job.
+The matrix factorization job computes the user-to-feature matrix and 
item-to-feature matrix given the user to item ratings. Its input includes: 
+<pre>
+    --input: directory containing files of explicit user to item rating or 
implicit feedback;
+    --output: output path of the user-feature matrix and feature-item matrix;
+    --lambda: regularization parameter to avoid overfitting;
+    --alpha: confidence parameter only used on implicit feedback
+    --implicitFeedback: boolean flag to indicate whether the input dataset 
contains implicit feedback;
+    --numFeatures: dimensions of feature space;
+    --numThreadsPerSolver: number of threads per solver mapper for concurrent 
execution;
+    --numIterations: number of iterations
+    --usesLongIDs: boolean flag to indicate whether the input contains long 
IDs that need to be translated
+</pre>
+and it outputs the matrices in sequence file format. 
+
+The recommendation job uses the user feature matrix and item feature matrix 
calculated from the factorization job to compute the top-N recommendations per 
user. Its input includes:
+<pre>
+    --input: directory containing files of user ids;
+    --output: output path of the recommended items for each input user id;
+    --userFeatures: path to the user feature matrix;
+    --itemFeatures: path to the item feature matrix;
+    --numRecommendations: maximum number of recommendations per user, default 
is 10;
+    --maxRating: maximum rating available;
+    --numThreads: number of threads per mapper;
+    --usesLongIDs: boolean flag to indicate whether the input contains long 
IDs that need to be translated;
+    --userIDIndex: index for user long IDs (necessary if usesLongIDs is true);
+    --itemIDIndex: index for item long IDs (necessary if usesLongIDs is true) 
+</pre>
+and it outputs a list of recommended item ids for each user. The predicted 
rating between user and item is a dot product of the user's feature vector and 
the item's feature vector.  
+
+##Example
+
+Letâs look at a simple example of how we could use Mahoutâs ALS 
recommender to recommend items for users. First, youâll need to get Mahout up 
and running, the instructions for which can be found 
[here](https://mahout.apache.org/users/basics/quickstart.html). After you've 
ensured Mahout is properly installed, weâre ready to run the example.
+
+**Step 1: Prepare test data**
+
+Similar to Mahout's item based recommender, the ALS recommender relies on the 
user to item preference data: *userID*, *itemID* and *preference*. The 
preference could be explicit numeric rating or counts of actions such as a 
click (implicit feedback). The test data file is organized as each line is a 
tab-delimited string, the 1st field is user id, which must be numeric, the 2nd 
field is item id, which must be numeric and the 3rd field is preference, which 
should also be a number.
+
+**Note:** You must create IDs that are ordinal positive integers for all user 
and item IDs. Often this will require you to keep a dictionary
+to map into and out of Mahout IDs. For instance if the first user has ID "xyz" 
in your application, this would get an Mahout ID of the integer 1 and so on. 
The same
+for item IDs. Then after recommendations are calculated you will have to 
translate the Mahout user and item IDs back into your application IDs.
+
+To quickly start, you could specify a text file like following as the input:
+<pre>
+1      100     1
+1      200     5
+1      400     1
+2      200     2
+2      300     1
+</pre>
+
+**Step 2: Determine parameters**
+
+In addition, users need to determine dimension of feature space, the number of 
iterations to run the alternating least square algorithm, Using 10 features and 
15 iterations is a reasonable default to try first. Optionally a confidence 
parameter can be set if the input preference is implicit user feedback.  
+
+**Step 3: Run ALS**
+
+Assuming your *JAVA_HOME* is appropriately set and Mahout was installed 
properly weâre ready to configure our syntax. Enter the following command:
+
+    $ mahout parallelALS --input $als_input --output $als_output --lambda 0.1 
--implicitFeedback true --alpha 0.8 --numFeatures 2 --numIterations 5  
--numThreadsPerSolver 1 --tempDir tmp 
+
+Running the command will execute a series of jobs the final product of which 
will be an output file deposited to the output directory specified in the 
command syntax. The output directory contains 3 sub-directories: *M* stores the 
item to feature matrix, *U* stores the user to feature matrix and userRatings 
stores the user's ratings on the items. The *tempDir* parameter specifies the 
directory to store the intermediate output of the job, such as the matrix 
output in each iteration and each item's average rating. Using the *tempDir* 
will help on debugging.
+
+**Step 4: Make Recommendations**
+
+Based on the output feature matrices from step 3, we could make 
recommendations for users. Enter the following command:
+
+     $ mahout recommendfactorized --input $als_recommender_input 
--userFeatures $als_output/U/ --itemFeatures $als_output/M/ 
--numRecommendations 1 --output recommendations --maxRating 1
+
+The input user file is a sequence file, the sequence record key is user id and 
value is the user's rated item ids which will be removed from recommendation. 
The output file generated in our simple example will be a text file giving the 
recommended item ids for each user. 
+Remember to translate the Mahout ids back into your application specific ids. 
+
+There exist a variety of parameters for Mahoutâs ALS recommender to 
accommodate custom business requirements; exploring and testing various 
configurations to suit your needs will doubtless lead to additional questions. 
Feel free to ask such questions on the [mailing 
list](https://mahout.apache.org/general/mailing-lists,-irc-and-archives.html).
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/516e3fb9/website/docs/tutorials/map-reduce/recommender/intro-cooccurrence-spark.md
----------------------------------------------------------------------
diff --git 
a/website/docs/tutorials/map-reduce/recommender/intro-cooccurrence-spark.md 
b/website/docs/tutorials/map-reduce/recommender/intro-cooccurrence-spark.md
new file mode 100644
index 0000000..578f3c4
--- /dev/null
+++ b/website/docs/tutorials/map-reduce/recommender/intro-cooccurrence-spark.md
@@ -0,0 +1,437 @@
+---
+layout: default
+title: Perceptron and Winnow
+theme:
+    name: retro-mahout
+---
+
+#Intro to Cooccurrence Recommenders with Spark
+
+Mahout provides several important building blocks for creating recommendations 
using Spark. *spark-itemsimilarity* can 
+be used to create "other people also liked these things" type recommendations 
and paired with a search engine can 
+personalize recommendations for individual users. *spark-rowsimilarity* can 
provide non-personalized content based 
+recommendations and when paired with a search engine can be used to 
personalize content based recommendations.
+
+##References
+
+1. A free ebook, which talks about the general idea: [Practical Machine 
Learning](https://www.mapr.com/practical-machine-learning)
+2. A slide deck, which talks about mixing actions or other indicators: 
[Creating a Unified 
Recommender](http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/)
+3. Two blog posts: [What's New in Recommenders: part 
#1](http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/)
+and  [What's New in Recommenders: part 
#2](http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/)
+3. A post describing the loglikelihood ratio:  [Surprise and 
Coinsidense](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html)
  LLR is used to reduce noise in the data while keeping the calculations O(n) 
complexity.
+
+Below are the command line jobs but the drivers and associated code can also 
be customized and accessed from the Scala APIs.
+
+##1. spark-itemsimilarity
+*spark-itemsimilarity* is the Spark counterpart of the of the Mahout mapreduce 
job called *itemsimilarity*. It takes in elements of interactions, which have 
userID, itemID, and optionally a value. It will produce one of more indicator 
matrices created by comparing every user's interactions with every other user. 
The indicator matrix is an item x item matrix where the values are 
log-likelihood ratio strengths. For the legacy mapreduce version, there were 
several possible similarity measures but these are being deprecated in favor of 
LLR because in practice it performs the best.
+
+Mahout's mapreduce version of itemsimilarity takes a text file that is 
expected to have user and item IDs that conform to 
+Mahout's ID requirements--they are non-negative integers that can be viewed as 
row and column numbers in a matrix.
+
+*spark-itemsimilarity* also extends the notion of cooccurrence to 
cross-cooccurrence, in other words the Spark version will 
+account for multi-modal interactions and create indicator matrices allowing 
the use of much more data in 
+creating recommendations or similar item lists. People try to do this by 
mixing different actions and giving them weights. 
+For instance they might say an item-view is 0.2 of an item purchase. In 
practice this is often not helpful. Spark-itemsimilarity's
+cross-cooccurrence is a more principled way to handle this case. In effect it 
scrubs secondary actions with the action you want
+to recommend.   
+
+
+    spark-itemsimilarity Mahout 1.0
+    Usage: spark-itemsimilarity [options]
+    
+    Disconnected from the target VM, address: '127.0.0.1:64676', transport: 
'socket'
+    Input, output options
+      -i <value> | --input <value>
+            Input path, may be a filename, directory name, or comma delimited 
list of HDFS supported URIs (required)
+      -i2 <value> | --input2 <value>
+            Secondary input path for cross-similarity calculation, same 
restrictions as "--input" (optional). Default: empty.
+      -o <value> | --output <value>
+            Path for output, any local or HDFS supported URI (required)
+    
+    Algorithm control options:
+      -mppu <value> | --maxPrefs <value>
+            Max number of preferences to consider per user (optional). 
Default: 500
+      -m <value> | --maxSimilaritiesPerItem <value>
+            Limit the number of similarities per item to this number 
(optional). Default: 100
+    
+    Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity 
measure.
+    
+    Input text file schema options:
+      -id <value> | --inDelim <value>
+            Input delimiter character (optional). Default: "[,\t]"
+      -f1 <value> | --filter1 <value>
+            String (or regex) whose presence indicates a datum for the primary 
item set (optional). Default: no filter, all data is used
+      -f2 <value> | --filter2 <value>
+            String (or regex) whose presence indicates a datum for the 
secondary item set (optional). If not present no secondary dataset is collected
+      -rc <value> | --rowIDColumn <value>
+            Column number (0 based Int) containing the row ID string 
(optional). Default: 0
+      -ic <value> | --itemIDColumn <value>
+            Column number (0 based Int) containing the item ID string 
(optional). Default: 1
+      -fc <value> | --filterColumn <value>
+            Column number (0 based Int) containing the filter string 
(optional). Default: -1 for no filter
+    
+    Using all defaults the input is expected of the form: "userID<tab>itemId" 
or "userID<tab>itemID<tab>any-text..." and all rows will be used
+    
+    File discovery options:
+      -r | --recursive
+            Searched the -i path recursively for files that match 
--filenamePattern (optional), Default: false
+      -fp <value> | --filenamePattern <value>
+            Regex to match in determining input files (optional). Default: 
filename in the --input option or "^part-.*" if --input is a directory
+    
+    Output text file schema options:
+      -rd <value> | --rowKeyDelim <value>
+            Separates the rowID key from the vector values list (optional). 
Default: "\t"
+      -cd <value> | --columnIdStrengthDelim <value>
+            Separates column IDs from their values in the vector values list 
(optional). Default: ":"
+      -td <value> | --elementDelim <value>
+            Separates vector element values in the values list (optional). 
Default: " "
+      -os | --omitStrength
+            Do not write the strength to the output files (optional), Default: 
false.
+    This option is used to output indexable data for creating a search engine 
recommender.
+    
+    Default delimiters will produce output of the form: 
"itemID1<tab>itemID2:value2<space>itemID10:value10..."
+    
+    Spark config options:
+      -ma <value> | --master <value>
+            Spark Master URL (optional). Default: "local". Note that you can 
specify the number of cores to get a performance improvement, for example 
"local[4]"
+      -sem <value> | --sparkExecutorMem <value>
+            Max Java heap available as "executor memory" on each node 
(optional). Default: 4g
+      -rs <value> | --randomSeed <value>
+            
+      -h | --help
+            prints this usage text
+
+This looks daunting but defaults to simple fairly sane values to take exactly 
the same input as legacy code and is pretty flexible. It allows the user to 
point to a single text file, a directory full of files, or a tree of 
directories to be traversed recursively. The files included can be specified 
with either a regex-style pattern or filename. The schema for the file is 
defined by column numbers, which map to the important bits of data including 
IDs and values. The files can even contain filters, which allow unneeded rows 
to be discarded or used for cross-cooccurrence calculations.
+
+See ItemSimilarityDriver.scala in Mahout's spark module if you want to 
customize the code. 
+
+###Defaults in the _**spark-itemsimilarity**_ CLI
+
+If all defaults are used the input can be as simple as:
+
+    userID1,itemID1
+    userID2,itemID2
+    ...
+
+With the command line:
+
+
+    bash$ mahout spark-itemsimilarity --input in-file --output out-dir
+
+
+This will use the "local" Spark context and will output the standard text 
version of a DRM
+
+    itemID1<tab>itemID2:value2<space>itemID10:value10...
+
+###<a name="multiple-actions">How To Use Multiple User Actions</a>
+
+Often we record various actions the user takes for later analytics. These can 
now be used to make recommendations. 
+The idea of a recommender is to recommend the action you want the user to 
make. For an ecom app this might be 
+a purchase action. It is usually not a good idea to just treat other actions 
the same as the action you want to recommend. 
+For instance a view of an item does not indicate the same intent as a purchase 
and if you just mixed the two together you 
+might even make worse recommendations. It is tempting though since there are 
so many more views than purchases. With *spark-itemsimilarity*
+we can now use both actions. Mahout will use cross-action cooccurrence 
analysis to limit the views to ones that do predict purchases.
+We do this by treating the primary action (purchase) as data for the indicator 
matrix and use the secondary action (view) 
+to calculate the cross-cooccurrence indicator matrix.  
+
+*spark-itemsimilarity* can read separate actions from separate files or from a 
mixed action log by filtering certain lines. For a mixed 
+action log of the form:
+
+    u1,purchase,iphone
+    u1,purchase,ipad
+    u2,purchase,nexus
+    u2,purchase,galaxy
+    u3,purchase,surface
+    u4,purchase,iphone
+    u4,purchase,galaxy
+    u1,view,iphone
+    u1,view,ipad
+    u1,view,nexus
+    u1,view,galaxy
+    u2,view,iphone
+    u2,view,ipad
+    u2,view,nexus
+    u2,view,galaxy
+    u3,view,surface
+    u3,view,nexus
+    u4,view,iphone
+    u4,view,ipad
+    u4,view,galaxy
+
+###Command Line
+
+
+Use the following options:
+
+    bash$ mahout spark-itemsimilarity \
+       --input in-file \     # where to look for data
+        --output out-path \   # root dir for output
+        --master masterUrl \  # URL of the Spark master server
+        --filter1 purchase \  # word that flags input for the primary action
+        --filter2 view \      # word that flags input for the secondary action
+        --itemIDPosition 2 \  # column that has the item ID
+        --rowIDPosition 0 \   # column that has the user ID
+        --filterPosition 1    # column that has the filter word
+
+
+
+###Output
+
+The output of the job will be the standard text version of two Mahout DRMs. 
This is a case where we are calculating 
+cross-cooccurrence so a primary indicator matrix and cross-cooccurrence 
indicator matrix will be created
+
+    out-path
+      |-- similarity-matrix - TDF part files
+      \-- cross-similarity-matrix - TDF part-files
+
+The indicator matrix will contain the lines:
+
+    galaxy\tnexus:1.7260924347106847
+    ipad\tiphone:1.7260924347106847
+    nexus\tgalaxy:1.7260924347106847
+    iphone\tipad:1.7260924347106847
+    surface
+
+The cross-cooccurrence indicator matrix will contain:
+
+    iphone\tnexus:1.7260924347106847 iphone:1.7260924347106847 
ipad:1.7260924347106847 galaxy:1.7260924347106847
+    ipad\tnexus:0.6795961471815897 iphone:0.6795961471815897 
ipad:0.6795961471815897 galaxy:0.6795961471815897
+    nexus\tnexus:0.6795961471815897 iphone:0.6795961471815897 
ipad:0.6795961471815897 galaxy:0.6795961471815897
+    galaxy\tnexus:1.7260924347106847 iphone:1.7260924347106847 
ipad:1.7260924347106847 galaxy:1.7260924347106847
+    surface\tsurface:4.498681156950466 nexus:0.6795961471815897
+
+**Note:** You can run this multiple times to use more than two actions or you 
can use the underlying 
+SimilarityAnalysis.cooccurrence API, which will more efficiently calculate any 
number of cross-cooccurrence indicators.
+
+###Log File Input
+
+A common method of storing data is in log files. If they are written using 
some delimiter they can be consumed directly by spark-itemsimilarity. For 
instance input of the form:
+
+    2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tiphone
+    2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tipad
+    2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tnexus
+    2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tgalaxy
+    2014-06-23 14:46:53.115\tu3\tpurchase\trandom text\tsurface
+    2014-06-23 14:46:53.115\tu4\tpurchase\trandom text\tiphone
+    2014-06-23 14:46:53.115\tu4\tpurchase\trandom text\tgalaxy
+    2014-06-23 14:46:53.115\tu1\tview\trandom text\tiphone
+    2014-06-23 14:46:53.115\tu1\tview\trandom text\tipad
+    2014-06-23 14:46:53.115\tu1\tview\trandom text\tnexus
+    2014-06-23 14:46:53.115\tu1\tview\trandom text\tgalaxy
+    2014-06-23 14:46:53.115\tu2\tview\trandom text\tiphone
+    2014-06-23 14:46:53.115\tu2\tview\trandom text\tipad
+    2014-06-23 14:46:53.115\tu2\tview\trandom text\tnexus
+    2014-06-23 14:46:53.115\tu2\tview\trandom text\tgalaxy
+    2014-06-23 14:46:53.115\tu3\tview\trandom text\tsurface
+    2014-06-23 14:46:53.115\tu3\tview\trandom text\tnexus
+    2014-06-23 14:46:53.115\tu4\tview\trandom text\tiphone
+    2014-06-23 14:46:53.115\tu4\tview\trandom text\tipad
+    2014-06-23 14:46:53.115\tu4\tview\trandom text\tgalaxy    
+
+Can be parsed with the following CLI and run on the cluster producing the same 
output as the above example.
+
+    bash$ mahout spark-itemsimilarity \
+        --input in-file \
+        --output out-path \
+        --master spark://sparkmaster:4044 \
+        --filter1 purchase \
+        --filter2 view \
+        --inDelim "\t" \
+        --itemIDPosition 4 \
+        --rowIDPosition 1 \
+        --filterPosition 2
+
+##2. spark-rowsimilarity
+
+*spark-rowsimilarity* is the companion to *spark-itemsimilarity* the primary 
difference is that it takes a text file version of 
+a matrix of sparse vectors with optional application specific IDs and it finds 
similar rows rather than items (columns). Its use is
+not limited to collaborative filtering. The input is in text-delimited form 
where there are three delimiters used. By 
+default it reads 
(rowID&lt;tab>columnID1:strength1&lt;space>columnID2:strength2...) Since this 
job only supports LLR similarity,
+ which does not use the input strengths, they may be omitted in the input. It 
writes 
+(rowID&lt;tab>rowID1:strength1&lt;space>rowID2:strength2...) 
+The output is sorted by strength descending. The output can be interpreted as 
a row ID from the primary input followed 
+by a list of the most similar rows.
+
+The command line interface is:
+
+    spark-rowsimilarity Mahout 1.0
+    Usage: spark-rowsimilarity [options]
+    
+    Input, output options
+      -i <value> | --input <value>
+            Input path, may be a filename, directory name, or comma delimited 
list of HDFS supported URIs (required)
+      -o <value> | --output <value>
+            Path for output, any local or HDFS supported URI (required)
+    
+    Algorithm control options:
+      -mo <value> | --maxObservations <value>
+            Max number of observations to consider per row (optional). 
Default: 500
+      -m <value> | --maxSimilaritiesPerRow <value>
+            Limit the number of similarities per item to this number 
(optional). Default: 100
+    
+    Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity 
measure.
+    Disconnected from the target VM, address: '127.0.0.1:49162', transport: 
'socket'
+    
+    Output text file schema options:
+      -rd <value> | --rowKeyDelim <value>
+            Separates the rowID key from the vector values list (optional). 
Default: "\t"
+      -cd <value> | --columnIdStrengthDelim <value>
+            Separates column IDs from their values in the vector values list 
(optional). Default: ":"
+      -td <value> | --elementDelim <value>
+            Separates vector element values in the values list (optional). 
Default: " "
+      -os | --omitStrength
+            Do not write the strength to the output files (optional), Default: 
false.
+    This option is used to output indexable data for creating a search engine 
recommender.
+    
+    Default delimiters will produce output of the form: 
"itemID1<tab>itemID2:value2<space>itemID10:value10..."
+    
+    File discovery options:
+      -r | --recursive
+            Searched the -i path recursively for files that match 
--filenamePattern (optional), Default: false
+      -fp <value> | --filenamePattern <value>
+            Regex to match in determining input files (optional). Default: 
filename in the --input option or "^part-.*" if --input is a directory
+    
+    Spark config options:
+      -ma <value> | --master <value>
+            Spark Master URL (optional). Default: "local". Note that you can 
specify the number of cores to get a performance improvement, for example 
"local[4]"
+      -sem <value> | --sparkExecutorMem <value>
+            Max Java heap available as "executor memory" on each node 
(optional). Default: 4g
+      -rs <value> | --randomSeed <value>
+            
+      -h | --help
+            prints this usage text
+
+See RowSimilarityDriver.scala in Mahout's spark module if you want to 
customize the code. 
+
+#3. Using *spark-rowsimilarity* with Text Data
+
+Another use case for *spark-rowsimilarity* is in finding similar textual 
content. For instance given the tags associated with 
+a blog post,
+ which other posts have similar tags. In this case the columns are tags and 
the rows are posts. Since LLR is 
+the only similarity method supported this is not the optimal way to determine 
general "bag-of-words" document similarity. 
+LLR is used more as a quality filter than as a similarity measure. However 
*spark-rowsimilarity* will produce 
+lists of similar docs for every doc if input is docs with lists of terms. The 
Apache [Lucene](http://lucene.apache.org) project provides several methods of 
[analyzing and 
tokenizing](http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/analysis/package-summary.html#package_description)
 documents.
+
+#<a name="unified-recommender">4. Creating a Unified Recommender</a>
+
+Using the output of *spark-itemsimilarity* and *spark-rowsimilarity* you can 
build a unified cooccurrence and content based
+ recommender that can be used in both or either mode depending on indicators 
available and the history available at 
+runtime for a user.
+
+##Requirements
+
+1. Mahout 0.10.0 or later
+2. Hadoop
+3. Spark, the correct version for your version of Mahout and Hadoop
+4. A search engine like Solr or Elasticsearch
+
+##Indicators
+
+Indicators come in 3 types
+
+1. **Cooccurrence**: calculated with *spark-itemsimilarity* from user actions
+2. **Content**: calculated from item metadata or content using 
*spark-rowsimilarity*
+3. **Intrinsic**: assigned to items as metadata. Can be anything that 
describes the item.
+
+The query for recommendations will be a mix of values meant to match one of 
your indicators. The query can be constructed 
+from user history and values derived from context (category being viewed for 
instance) or special precalculated data 
+(popularity rank for instance). This blending of indicators allows for 
creating many flavors or recommendations to fit 
+a very wide variety of circumstances.
+
+With the right mix of indicators developers can construct a single query that 
works for completely new items and new users 
+while working well for items with lots of interactions and users with many 
recorded actions. In other words by adding in content and intrinsic 
+indicators developers can create a solution for the "cold-start" problem that 
gracefully improves with more user history
+and as items have more interactions. It is also possible to create a 
completely content-based recommender that personalizes 
+recommendations.
+
+##Example with 3 Indicators
+
+You will need to decide how you store user action data so they can be 
processed by the item and row similarity jobs and 
+this is most easily done by using text files as described above. The data that 
is processed by these jobs is considered the 
+training data. You will need some amount of user history in your recs query. 
It is typical to use the most recent user history 
+but need not be exactly what is in the training set, which may include a 
greater volume of historical data. Keeping the user 
+history for query purposes could be done with a database by storing it in a 
users table. In the example above the two 
+collaborative filtering actions are "purchase" and "view", but let's also add 
tags (taken from catalog categories or other 
+descriptive metadata). 
+
+We will need to create 1 cooccurrence indicator from the primary action 
(purchase) 1 cross-action cooccurrence indicator 
+from the secondary action (view) 
+and 1 content indicator (tags). We'll have to run *spark-itemsimilarity* once 
and *spark-rowsimilarity* once.
+
+We have described how to create the collaborative filtering indicator and 
cross-cooccurrence indicator for purchase and view (the [How to use Multiple 
User 
+Actions](#multiple-actions) section) but tags will be a slightly different 
process. We want to use the fact that 
+certain items have tags similar to the ones associated with a user's 
purchases. This is not a collaborative filtering indicator 
+but rather a "content" or "metadata" type indicator since you are not using 
other users' history, only the 
+individual that you are making recs for. This means that this method will make 
recommendations for items that have 
+no collaborative filtering data, as happens with new items in a catalog. New 
items may have tags assigned but no one
+ has purchased or viewed them yet. In the final query we will mix all 3 
indicators.
+
+##Content Indicator
+
+To create a content-indicator we'll make use of the fact that the user has 
purchased items with certain tags. We want to find 
+items with the most similar tags. Notice that other users' behavior is not 
considered--only other item's tags. This defines a 
+content or metadata indicator. They are used when you want to find items that 
are similar to other items by using their 
+content or metadata, not by which users interacted with them.
+
+For this we need input of the form:
+
+    itemID<tab>list-of-tags
+    ...
+
+The full collection will look like the tags column from a catalog DB. For our 
ecom example it might be:
+
+    3459860b<tab>men long-sleeve chambray clothing casual
+    9446577d<tab>women tops chambray clothing casual
+    ...
+
+We'll use *spark-rowimilairity* because we are looking for similar rows, which 
encode items in this case. As with the 
+collaborative filtering indicator and cross-cooccurrence indicator we use the 
--omitStrength option. The strengths created are 
+probabilistic log-likelihood ratios and so are used to filter unimportant 
similarities. Once the filtering or downsampling 
+is finished we no longer need the strengths. We will get an indicator matrix 
of the form:
+
+    itemID<tab>list-of-item IDs
+    ...
+
+This is a content indicator since it has found other items with similar 
content or metadata.
+
+    3459860b<tab>3459860b 3459860b 6749860c 5959860a 3434860a 3477860a
+    9446577d<tab>9446577d 9496577d 0943577d 8346577d 9442277d 9446577e
+    ...  
+    
+We now have three indicators, two collaborative filtering type and one content 
type.
+
+##Unified Recommender Query
+
+The actual form of the query for recommendations will vary depending on your 
search engine but the intent is the same. 
+For a given user, map their history of an action or content to the correct 
indicator field and perform an OR'd query. 
+
+We have 3 indicators, these are indexed by the search engine into 3 fields, 
we'll call them "purchase", "view", and "tags". 
+We take the user's history that corresponds to each indicator and create a 
query of the form:
+
+    Query:
+      field: purchase; q:user's-purchase-history
+      field: view; q:user's view-history
+      field: tags; q:user's-tags-associated-with-purchases
+      
+The query will result in an ordered list of items recommended for purchase but 
skewed towards items with similar tags to 
+the ones the user has already purchased. 
+
+This is only an example and not necessarily the optimal way to create recs. It 
illustrates how business decisions can be 
+translated into recommendations. This technique can be used to skew 
recommendations towards intrinsic indicators also. 
+For instance you may want to put personalized popular item recs in a special 
place in the UI. Create a popularity indicator 
+by tagging items with some category of popularity (hot, warm, cold for 
instance) then
+index that as a new indicator field and include the corresponding value in a 
query 
+on the popularity field. If we use the ecom example but use the query to get 
"hot" recommendations it might look like this:
+
+    Query:
+      field: purchase; q:user's-purchase-history
+      field: view; q:user's view-history
+      field: popularity; q:"hot"
+
+This will return recommendations favoring ones that have the intrinsic 
indicator "hot".
+
+##Notes
+1. Use as much user action history as you can gather. Choose a primary action 
that is closest to what you want to recommend and the others will be used to 
create cross-cooccurrence indicators. Using more data in this fashion will 
almost always produce better recommendations.
+2. Content can be used where there is no recorded user behavior or when items 
change too quickly to get much interaction history. They can be used alone or 
mixed with other indicators.
+3. Most search engines support "boost" factors so you can favor one or more 
indicators. In the example query, if you want tags to only have a small effect 
you could boost the CF indicators.
+4. In the examples we have used space delimited strings for lists of IDs in 
indicators and in queries. It may be better to use arrays of strings if your 
storage system and search engine support them. For instance Solr allows 
multi-valued fields, which correspond to arrays.

[21/62] [abbrv] mahout git commit: WEBSITE Ported MR-Clustering Tutorials and Algos

Reply via email to