[5/9] mahout git commit: WEBSITE Triage of Old Site Migration

rawkintrevo Sat, 29 Apr 2017 16:25:05 -0700

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/clustering/fuzzy-k-means-commandline.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/fuzzy-k-means-commandline.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/fuzzy-k-means-commandline.md
new file mode 100644
index 0000000..1374682
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/fuzzy-k-means-commandline.md
@@ -0,0 +1,97 @@
+---
+layout: default
+title: fuzzy-k-means-commandline
+theme:
+   name: retro-mahout
+---
+
+<a 
name="fuzzy-k-means-commandline-RunningFuzzyk-MeansClusteringfromtheCommandLine"></a>
+# Running Fuzzy k-Means Clustering from the Command Line
+Mahout's Fuzzy k-Means clustering can be launched from the same command
+line invocation whether you are running on a single machine in stand-alone
+mode or on a larger Hadoop cluster. The difference is determined by the
+$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to
+an operating Hadoop cluster on the target machine then the invocation will
+run FuzzyK on that cluster. If either of the environment variables are
+missing then the stand-alone Hadoop configuration will be invoked instead.
+
+
+    ./bin/mahout fkmeans <OPTIONS>
+
+
+* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.job
+
+
+<a name="fuzzy-k-means-commandline-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster
+
+* Put the data: cp <PATH TO DATA> testdata
+* Run the Job: 
+
+    ./bin/mahout fkmeans -i testdata <OPTIONS>
+
+
+<a name="fuzzy-k-means-commandline-Runningitonthecluster"></a>
+## Running it on the cluster
+
+* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+* Run the Job: 
+
+    export HADOOP_HOME=<Hadoop Home Directory>
+    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
+    ./bin/mahout fkmeans -i testdata <OPTIONS>
+
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="fuzzy-k-means-commandline-Commandlineoptions"></a>
+# Command line options
+
+      --input (-i) input                              Path to job input 
directory. 
+                                              Must be a SequenceFile of    
+                                              VectorWritable               
+      --clusters (-c) clusters                The input centroids, as Vectors. 
+                                              Must be a SequenceFile of    
+                                              Writable, Cluster/Canopy. If k  
+                                              is also specified, then a random 
+                                              set of vectors will be selected  
+                                              and written out to this path 
+                                              first                        
+      --output (-o) output                            The directory pathname 
for   
+                                              output.                      
+      --distanceMeasure (-dm) distanceMeasure      The classname of the        
    
+                                              DistanceMeasure. Default is  
+                                              SquaredEuclidean             
+      --convergenceDelta (-cd) convergenceDelta    The convergence delta 
value. 
+                                              Default is 0.5               
+      --maxIter (-x) maxIter                  The maximum number of        
+                                              iterations.                  
+      --k (-k) k                                      The k in k-Means.  If 
specified, 
+                                              then a random selection of k 
+                                              Vectors will be chosen as the
+                                                      Centroid and written to 
the  
+                                              clusters input path.         
+      --m (-m) m                                      coefficient 
normalization    
+                                              factor, must be greater than 1   
+      --overwrite (-ow)                               If present, overwrite 
the output 
+                                              directory before running job 
+      --help (-h)                                     Print out help           
    
+      --numMap (-u) numMap                            The number of map tasks. 
    
+                                              Defaults to 10               
+      --maxRed (-r) maxRed                            The number of reduce 
tasks.  
+                                              Defaults to 2                
+      --emitMostLikely (-e) emitMostLikely            True if clustering 
should emit   
+                                              the most likely point only,  
+                                              false for threshold clustering.  
+                                              Default is true              
+      --threshold (-t) threshold                      The pdf threshold used 
for   
+                                              cluster determination. Default   
+                                              is 0 
+      --clustering (-cl)                              If present, run 
clustering after 
+                                              the iterations have taken place  
+                                
+


http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/clustering/fuzzy-k-means.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/fuzzy-k-means.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/fuzzy-k-means.md
new file mode 100644
index 0000000..ec53e62
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/fuzzy-k-means.md
@@ -0,0 +1,186 @@
+---
+layout: default
+title: Fuzzy K-Means
+theme:
+   name: retro-mahout
+---
+
+# Fuzzy K-Means
+
+Fuzzy K-Means (also called Fuzzy C-Means) is an extension of 
[K-Means](http://mahout.apache.org/users/clustering/k-means-clustering.html)
+, the popular simple clustering technique. While K-Means discovers hard
+clusters (a point belong to only one cluster), Fuzzy K-Means is a more
+statistically formalized method and discovers soft clusters where a
+particular point can belong to more than one cluster with certain
+probability.
+
+<a name="FuzzyK-Means-Algorithm"></a>
+#### Algorithm
+
+Like K-Means, Fuzzy K-Means works on those objects which can be represented
+in n-dimensional vector space and a distance measure is defined.
+The algorithm is similar to k-means.
+
+* Initialize k clusters
+* Until converged
+    * Compute the probability of a point belong to a cluster for every 
<point,cluster> pair
+    * Recompute the cluster centers using above probability membership values 
of points to clusters
+
+<a name="FuzzyK-Means-DesignImplementation"></a>
+#### Design Implementation
+
+The design is similar to K-Means present in Mahout. It accepts an input
+file containing vector points. User can either provide the cluster centers
+as input or can allow canopy algorithm to run and create initial clusters.
+
+Similar to K-Means, the program doesn't modify the input directories. And
+for every iteration, the cluster output is stored in a directory cluster-N.
+The code has set number of reduce tasks equal to number of map tasks. So,
+those many part-0
+  
+  
+Files are created in clusterN directory. The code uses
+driver/mapper/combiner/reducer as follows:
+
+FuzzyKMeansDriver - This is similar to&nbsp; KMeansDriver. It iterates over
+input points and cluster points for specified number of iterations or until
+it is converged.During every iteration i, a new cluster-i directory is
+created which contains the modified cluster centers obtained during
+FuzzyKMeans iteration. This will be feeded as input clusters in the next
+iteration.&nbsp; Once Fuzzy KMeans is run for specified number of
+iterations or until it is converged, a map task is run to output "the point
+and the cluster membership to each cluster" pair as final output to a
+directory named "points".
+
+FuzzyKMeansMapper - reads the input cluster during its configure() method,
+then&nbsp; computes cluster membership probability of a point to each
+cluster.Cluster membership is inversely propotional to the distance.
+Distance is computed using&nbsp; user supplied distance measure. Output key
+is encoded clusterId. Output values are ClusterObservations containing
+observation statistics.
+
+FuzzyKMeansCombiner - receives all key:value pairs from the mapper and
+produces partial sums of the cluster membership probability times input
+vectors for each cluster. Output key is: encoded cluster identifier. Output
+values are ClusterObservations containing observation statistics.
+
+FuzzyKMeansReducer - Multiple reducers receives certain keys and all values
+associated with those keys. The reducer sums the values to produce a new
+centroid for the cluster which is output. Output key is: encoded cluster
+identifier (e.g. "C14". Output value is: formatted cluster identifier (e.g.
+"C14"). The reducer encodes unconverged clusters with a 'Cn' cluster Id and
+converged clusters with 'Vn' clusterId.
+
+<a name="FuzzyK-Means-RunningFuzzyk-MeansClustering"></a>
+## Running Fuzzy k-Means Clustering
+
+The Fuzzy k-Means clustering algorithm may be run using a command-line
+invocation on FuzzyKMeansDriver.main or by making a Java call to
+FuzzyKMeansDriver.run(). 
+
+Invocation using the command line takes the form:
+
+
+    bin/mahout fkmeans \
+        -i <input vectors directory> \
+        -c <input clusters directory> \
+        -o <output working directory> \
+        -dm <DistanceMeasure> \
+        -m <fuzziness argument >1> \
+        -x <maximum number of iterations> \
+        -k <optional number of initial clusters to sample from input vectors> \
+        -cd <optional convergence delta. Default is 0.5> \
+        -ow <overwrite output directory if present>
+        -cl <run input vector clustering after computing Clusters>
+        -e <emit vectors to most likely cluster during clustering>
+        -t <threshold to use for clustering if -e is false>
+        -xm <execution method: sequential or mapreduce>
+
+
+*Note:* if the -k argument is supplied, any clusters in the -c directory
+will be overwritten and -k random points will be sampled from the input
+vectors to become the initial cluster centers.
+
+Invocation using Java involves supplying the following arguments:
+
+1. input: a file path string to a directory containing the input data set a
+SequenceFile(WritableComparable, VectorWritable). The sequence file _key_
+is not used.
+1. clustersIn: a file path string to a directory containing the initial
+clusters, a SequenceFile(key, SoftCluster | Cluster | Canopy). Fuzzy
+k-Means SoftClusters, k-Means Clusters and Canopy Canopies may be used for
+the initial clusters.
+1. output: a file path string to an empty directory which is used for all
+output from the algorithm.
+1. measure: the fully-qualified class name of an instance of DistanceMeasure
+which will be used for the clustering.
+1. convergence: a double value used to determine if the algorithm has
+converged (clusters have not moved more than the value in the last
+iteration)
+1. max-iterations: the maximum number of iterations to run, independent of
+the convergence specified
+1. m: the "fuzzyness" argument, a double > 1. For m equal to 2, this is
+equivalent to normalising the coefficient linearly to make their sum 1.
+When m is close to 1, then the cluster center closest to the point is given
+much more weight than the others, and the algorithm is similar to k-means.
+1. runClustering: a boolean indicating, if true, that the clustering step is
+to be executed after clusters have been determined.
+1. emitMostLikely: a boolean indicating, if true, that the clustering step
+should only emit the most likely cluster for each clustered point.
+1. threshold: a double indicating, if emitMostLikely is false, the cluster
+probability threshold used for emitting multiple clusters for each point. A
+value of 0 will emit all clusters with their associated probabilities for
+each vector.
+1. runSequential: a boolean indicating, if true, that the algorithm is to
+use the sequential reference implementation running in memory.
+
+After running the algorithm, the output directory will contain:
+1. clusters-N: directories containing SequenceFiles(Text, SoftCluster)
+produced by the algorithm for each iteration. The Text _key_ is a cluster
+identifier string.
+1. clusteredPoints: (if runClustering enabled) a directory containing
+SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable _key_ is
+the clusterId. The WeightedVectorWritable _value_ is a bean containing a
+double _weight_ and a VectorWritable _vector_ where the weights are
+computed as 1/(1+distance) where the distance is between the cluster center
+and the vector using the chosen DistanceMeasure. 
+
+<a name="FuzzyK-Means-Examples"></a>
+# Examples
+
+The following images illustrate Fuzzy k-Means clustering applied to a set
+of randomly-generated 2-d data points. The points are generated using a
+normal distribution centered at a mean location and with a constant
+standard deviation. See the README file in the 
[/examples/src/main/java/org/apache/mahout/clustering/display/README.txt](https://github.com/apache/mahout/blob/master/examples/src/main/java/org/apache/mahout/clustering/display/README.txt)
+ for details on running similar examples.
+
+The points are generated as follows:
+
+* 500 samples m=\[1.0, 1.0\](1.0,-1.0\.html)
+ sd=3.0
+* 300 samples m=\[1.0, 0.0\](1.0,-0.0\.html)
+ sd=0.5
+* 300 samples m=\[0.0, 2.0\](0.0,-2.0\.html)
+ sd=0.1
+
+In the first image, the points are plotted and the 3-sigma boundaries of
+their generator are superimposed. 
+
+![fuzzy](../../images/SampleData.png)
+
+In the second image, the resulting clusters (k=3) are shown superimposed upon 
the sample data. As Fuzzy k-Means is an iterative algorithm, the centers of the 
clusters in each recent iteration are shown using different colors. Bold red is 
the final clustering and previous iterations are shown in \[orange, yellow, 
green, blue, violet and 
gray\](orange,-yellow,-green,-blue,-violet-and-gray\.html)
+. Although it misses a lot of the points and cannot capture the original,
+superimposed cluster centers, it does a decent job of clustering this data.
+
+![fuzzy](../../images/FuzzyKMeans.png)
+
+The third image shows the results of running Fuzzy k-Means on a different
+data set which is generated using asymmetrical standard deviations.
+Fuzzy k-Means does a fair job handling this data set as well.
+
+![fuzzy](../../images/2dFuzzyKMeans.png)
+
+<a name="FuzzyK-Means-References&nbsp;"></a>
+#### References&nbsp;
+
+* 
[http://en.wikipedia.org/wiki/Fuzzy_clustering](http://en.wikipedia.org/wiki/Fuzzy_clustering)
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/clustering/hierarchical-clustering.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/hierarchical-clustering.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/hierarchical-clustering.md
new file mode 100644
index 0000000..6c541cc
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/hierarchical-clustering.md
@@ -0,0 +1,15 @@
+---
+layout: default
+title: Hierarchical Clustering
+theme:
+   name: retro-mahout
+---
+Hierarchical clustering is the process or finding bigger clusters, and also
+the smaller clusters inside the bigger clusters.
+
+In Apache Mahout, separate algorithms can be used for finding clusters at
+different levels. 
+
+See [Top Down 
Clustering](https://cwiki.apache.org/confluence/display/MAHOUT/Top+Down+Clustering)
+.
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/clustering/k-means-clustering.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/k-means-clustering.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/k-means-clustering.md
new file mode 100644
index 0000000..5c25763
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/k-means-clustering.md
@@ -0,0 +1,182 @@
+---
+layout: default
+title: K-Means Clustering
+theme:
+   name: retro-mahout
+---
+
+# k-Means clustering - basics
+
+[k-Means](http://en.wikipedia.org/wiki/Kmeans) is a simple but well-known 
algorithm for grouping objects, clustering. All objects need to be represented
+as a set of numerical features. In addition, the user has to specify the
+number of groups (referred to as *k*) she wishes to identify.
+
+Each object can be thought of as being represented by some feature vector
+in an _n_ dimensional space, _n_ being the number of all features used to
+describe the objects to cluster. The algorithm then randomly chooses _k_
+points in that vector space, these point serve as the initial centers of
+the clusters. Afterwards all objects are each assigned to the center they
+are closest to. Usually the distance measure is chosen by the user and
+determined by the learning task.
+
+After that, for each cluster a new center is computed by averaging the
+feature vectors of all objects assigned to it. The process of assigning
+objects and recomputing centers is repeated until the process converges.
+The algorithm can be proven to converge after a finite number of
+iterations.
+
+Several tweaks concerning distance measure, initial center choice and
+computation of new average centers have been explored, as well as the
+estimation of the number of clusters _k_. Yet the main principle always
+remains the same.
+
+
+
+<a name="K-MeansClustering-Quickstart"></a>
+## Quickstart
+
+[Here](https://github.com/apache/mahout/blob/master/examples/bin/cluster-reuters.sh)
+ is a short shell script outline that will get you started quickly with
+k-means. This does the following:
+
+* Accepts clustering type: *kmeans*, *fuzzykmeans*, *lda*, or *streamingkmeans*
+* Gets the Reuters dataset
+* Runs org.apache.lucene.benchmark.utils.ExtractReuters to generate
+reuters-out from reuters-sgm (the downloaded archive)
+* Runs seqdirectory to convert reuters-out to SequenceFile format
+* Runs seq2sparse to convert SequenceFiles to sparse vector format
+* Runs k-means with 20 clusters
+* Runs clusterdump to show results
+
+After following through the output that scrolls past, reading the code will
+offer you a better understanding.
+
+
+<a name="K-MeansClustering-Designofimplementation"></a>
+## Implementation
+
+The implementation accepts two input directories: one for the data points
+and one for the initial clusters. The data directory contains multiple
+input files of SequenceFile(Key, VectorWritable), while the clusters
+directory contains one or more SequenceFiles(Text, Cluster)
+containing _k_ initial clusters or canopies. None of the input directories
+are modified by the implementation, allowing experimentation with initial
+clustering and convergence values.
+
+Canopy clustering can be used to compute the initial clusters for k-KMeans:
+
+    // run the CanopyDriver job
+    CanopyDriver.runJob("testdata", "output"
+    ManhattanDistanceMeasure.class.getName(), (float) 3.1, (float) 2.1, false);
+
+    // now run the KMeansDriver job
+    KMeansDriver.runJob("testdata", "output/clusters-0", "output",
+    EuclideanDistanceMeasure.class.getName(), "0.001", "10", true);
+
+
+In the above example, the input data points are stored in 'testdata' and
+the CanopyDriver is configured to output to the 'output/clusters-0'
+directory. Once the driver executes it will contain the canopy definition
+files. Upon running the KMeansDriver the output directory will have two or
+more new directories: 'clusters-N'' containining the clusters for each
+iteration and 'clusteredPoints' will contain the clustered data points.
+
+This diagram shows the examplary dataflow of the k-Means example
+implementation provided by Mahout:
+<img src="../../images/Example implementation of k-Means provided with 
Mahout.png">
+
+
+<a name="K-MeansClustering-Runningk-MeansClustering"></a>
+## Running k-Means Clustering
+
+The k-Means clustering algorithm may be run using a command-line invocation
+on KMeansDriver.main or by making a Java call to KMeansDriver.runJob().
+
+Invocation using the command line takes the form:
+
+
+    bin/mahout kmeans \
+        -i <input vectors directory> \
+        -c <input clusters directory> \
+        -o <output working directory> \
+        -k <optional number of initial clusters to sample from input vectors> \
+        -dm <DistanceMeasure> \
+        -x <maximum number of iterations> \
+        -cd <optional convergence delta. Default is 0.5> \
+        -ow <overwrite output directory if present>
+        -cl <run input vector clustering after computing Canopies>
+        -xm <execution method: sequential or mapreduce>
+
+
+Note: if the \-k argument is supplied, any clusters in the \-c directory
+will be overwritten and \-k random points will be sampled from the input
+vectors to become the initial cluster centers.
+
+Invocation using Java involves supplying the following arguments:
+
+1. input: a file path string to a directory containing the input data set a
+SequenceFile(WritableComparable, VectorWritable). The sequence file _key_
+is not used.
+1. clusters: a file path string to a directory containing the initial
+clusters, a SequenceFile(key, Cluster \| Canopy). Both KMeans clusters and
+Canopy canopies may be used for the initial clusters.
+1. output: a file path string to an empty directory which is used for all
+output from the algorithm.
+1. distanceMeasure: the fully-qualified class name of an instance of
+DistanceMeasure which will be used for the clustering.
+1. convergenceDelta: a double value used to determine if the algorithm has
+converged (clusters have not moved more than the value in the last
+iteration)
+1. maxIter: the maximum number of iterations to run, independent of the
+convergence specified
+1. runClustering: a boolean indicating, if true, that the clustering step is
+to be executed after clusters have been determined.
+1. runSequential: a boolean indicating, if true, that the k-means sequential
+implementation is to be used to process the input data.
+
+After running the algorithm, the output directory will contain:
+1. clusters-N: directories containing SequenceFiles(Text, Cluster) produced
+by the algorithm for each iteration. The Text _key_ is a cluster identifier
+string.
+1. clusteredPoints: (if \--clustering enabled) a directory containing
+SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable _key_ is
+the clusterId. The WeightedVectorWritable _value_ is a bean containing a
+double _weight_ and a VectorWritable _vector_ where the weight indicates
+the probability that the vector is a member of the cluster. For k-Means
+clustering, the weights are computed as 1/(1+distance) where the distance
+is between the cluster center and the vector using the chosen
+DistanceMeasure.
+
+<a name="K-MeansClustering-Examples"></a>
+# Examples
+
+The following images illustrate k-Means clustering applied to a set of
+randomly-generated 2-d data points. The points are generated using a normal
+distribution centered at a mean location and with a constant standard
+deviation. See the README file in the 
[/examples/src/main/java/org/apache/mahout/clustering/display/README.txt](https://github.com/apache/mahout/blob/master/examples/src/main/java/org/apache/mahout/clustering/display/README.txt)
+ for details on running similar examples.
+
+The points are generated as follows:
+
+* 500 samples m=\[1.0, 1.0\](1.0,-1.0\.html)
+ sd=3.0
+* 300 samples m=\[1.0, 0.0\](1.0,-0.0\.html)
+ sd=0.5
+* 300 samples m=\[0.0, 2.0\](0.0,-2.0\.html)
+ sd=0.1
+
+In the first image, the points are plotted and the 3-sigma boundaries of
+their generator are superimposed.
+
+![Sample data graph](../../images/SampleData.png)
+
+In the second image, the resulting clusters (k=3) are shown superimposed upon 
the sample data. As k-Means is an iterative algorithm, the centers of the 
clusters in each recent iteration are shown using different colors. Bold red is 
the final clustering and previous iterations are shown in \[orange, yellow, 
green, blue, violet and 
gray\](orange,-yellow,-green,-blue,-violet-and-gray\.html)
+. Although it misses a lot of the points and cannot capture the original,
+superimposed cluster centers, it does a decent job of clustering this data.
+
+![kmeans](../../images/KMeans.png)
+
+The third image shows the results of running k-Means on a different dataset, 
which is generated using asymmetrical standard deviations.
+K-Means does a fair job handling this data set as well.
+
+![2d kmeans](../../images/2dKMeans.png)
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/clustering/k-means-commandline.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/k-means-commandline.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/k-means-commandline.md
new file mode 100644
index 0000000..8d802f8
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/k-means-commandline.md
@@ -0,0 +1,94 @@
+---
+layout: default
+title: k-means-commandline
+theme:
+   name: retro-mahout
+---
+
+<a name="k-means-commandline-Introduction"></a>
+# kMeans commandline introduction
+
+This quick start page describes how to run the kMeans clustering algorithm
+on a Hadoop cluster. 
+
+<a name="k-means-commandline-Steps"></a>
+# Steps
+
+Mahout's k-Means clustering can be launched from the same command line
+invocation whether you are running on a single machine in stand-alone mode
+or on a larger Hadoop cluster. The difference is determined by the
+$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to
+an operating Hadoop cluster on the target machine then the invocation will
+run k-Means on that cluster. If either of the environment variables are
+missing then the stand-alone Hadoop configuration will be invoked instead.
+
+
+    ./bin/mahout kmeans <OPTIONS>
+
+
+In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.job
+
+
+<a name="k-means-commandline-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster
+
+* Put the data: cp <PATH TO DATA> testdata
+* Run the Job: 
+
+    ./bin/mahout kmeans -i testdata -o output -c clusters -dm
+org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k
+25
+
+
+<a name="k-means-commandline-Runningitonthecluster"></a>
+## Running it on the cluster
+
+* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+* Run the Job: 
+
+    export HADOOP_HOME=<Hadoop Home Directory>
+    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
+    ./bin/mahout kmeans -i testdata -o output -c clusters -dm 
org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k 25
+
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="k-means-commandline-Commandlineoptions"></a>
+# Command line options
+
+      --input (-i) input                              Path to job input 
directory. 
+                                              Must be a SequenceFile of    
+                                              VectorWritable               
+      --clusters (-c) clusters                The input centroids, as Vectors. 
+                                              Must be a SequenceFile of    
+                                              Writable, Cluster/Canopy. If k  
+                                              is also specified, then a random 
+                                              set of vectors will be selected  
+                                              and written out to this path 
+                                              first                        
+      --output (-o) output                            The directory pathname 
for   
+                                              output.                      
+      --distanceMeasure (-dm) distanceMeasure      The classname of the        
    
+                                              DistanceMeasure. Default is  
+                                              SquaredEuclidean             
+      --convergenceDelta (-cd) convergenceDelta    The convergence delta 
value. 
+                                              Default is 0.5               
+      --maxIter (-x) maxIter                  The maximum number of        
+                                              iterations.                  
+      --maxRed (-r) maxRed                            The number of reduce 
tasks.  
+                                              Defaults to 2                
+      --k (-k) k                                      The k in k-Means.  If 
specified, 
+                                              then a random selection of k 
+                                              Vectors will be chosen as the    
+                                              Centroid and written to the  
+                                              clusters input path.         
+      --overwrite (-ow)                               If present, overwrite 
the output 
+                                              directory before running job 
+      --help (-h)                                     Print out help           
    
+      --clustering (-cl)                              If present, run 
clustering after 
+                                              the iterations have taken place  
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/clustering/latent-dirichlet-allocation.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/latent-dirichlet-allocation.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/latent-dirichlet-allocation.md
new file mode 100644
index 0000000..871cea2
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/latent-dirichlet-allocation.md
@@ -0,0 +1,155 @@
+---
+layout: default
+title: Latent Dirichlet Allocation
+theme:
+   name: retro-mahout
+---
+
+<a name="LatentDirichletAllocation-Overview"></a>
+# Overview
+
+Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning
+algorithm for automatically and jointly clustering words into "topics" and
+documents into mixtures of topics. It has been successfully applied to
+model change in scientific fields over time (Griffiths and Steyvers, 2004;
+Hall, et al. 2008). 
+
+A topic model is, roughly, a hierarchical Bayesian model that associates
+with each document a probability distribution over "topics", which are in
+turn distributions over words. For instance, a topic in a collection of
+newswire might include words about "sports", such as "baseball", "home
+run", "player", and a document about steroid use in baseball might include
+"sports", "drugs", and "politics". Note that the labels "sports", "drugs",
+and "politics", are post-hoc labels assigned by a human, and that the
+algorithm itself only assigns associate words with probabilities. The task
+of parameter estimation in these models is to learn both what the topics
+are, and which documents employ them in what proportions.
+
+Another way to view a topic model is as a generalization of a mixture model
+like [Dirichlet Process 
Clustering](http://en.wikipedia.org/wiki/Dirichlet_process)
+. Starting from a normal mixture model, in which we have a single global
+mixture of several distributions, we instead say that _each_ document has
+its own mixture distribution over the globally shared mixture components.
+Operationally in Dirichlet Process Clustering, each document has its own
+latent variable drawn from a global mixture that specifies which model it
+belongs to, while in LDA each word in each document has its own parameter
+drawn from a document-wide mixture.
+
+The idea is that we use a probabilistic mixture of a number of models that
+we use to explain some observed data. Each observed data point is assumed
+to have come from one of the models in the mixture, but we don't know
+which. The way we deal with that is to use a so-called latent parameter
+which specifies which model each data point came from.
+
+<a name="LatentDirichletAllocation-CollapsedVariationalBayes"></a>
+# Collapsed Variational Bayes
+The CVB algorithm which is implemented in Mahout for LDA combines
+advantages of both regular Variational Bayes and Gibbs Sampling.  The
+algorithm relies on modeling dependence of parameters on latest variables
+which are in turn mutually independent.   The algorithm uses 2
+methodologies to marginalize out parameters when calculating the joint
+distribution and the other other is to model the posterior of theta and phi
+given the inputs z and x.
+
+A common solution to the CVB algorithm is to compute each expectation term
+by using simple Gaussian approximation which is accurate and requires low
+computational overhead.  The specifics behind the approximation involve
+computing the sum of the means and variances of the individual Bernoulli
+variables.
+
+CVB with Gaussian approximation is implemented by tracking the mean and
+variance and subtracting the mean and variance of the corresponding
+Bernoulli variables.  The computational cost for the algorithm scales on
+the order of O(K) with each update to q(z(i,j)).  Also for each
+document/word pair only 1 copy of the variational posterior is required
+over the latent variable.
+
+<a name="LatentDirichletAllocation-InvocationandUsage"></a>
+# Invocation and Usage
+
+Mahout's implementation of LDA operates on a collection of SparseVectors of
+word counts. These word counts should be non-negative integers, though
+things will-- probably --work fine if you use non-negative reals. (Note
+that the probabilistic model doesn't make sense if you do!) To create these
+vectors, it's recommended that you follow the instructions in [Creating 
Vectors From Text](../basics/creating-vectors-from-text.html)
+, making sure to use TF and not TFIDF as the scorer.
+
+Invocation takes the form:
+
+
+    bin/mahout cvb \
+        -i <input path for document vectors> \
+        -dict <path to term-dictionary file(s) , glob expression supported> \
+        -o <output path for topic-term distributions>
+        -dt <output path for doc-topic distributions> \
+        -k <number of latent topics> \
+        -nt <number of unique features defined by input document vectors> \
+        -mt <path to store model state after each iteration> \
+        -maxIter <max number of iterations> \
+        -mipd <max number of iterations per doc for learning> \
+        -a <smoothing for doc topic distributions> \
+        -e <smoothing for term topic distributions> \
+        -seed <random seed> \
+        -tf <fraction of data to hold for testing> \
+        -block <number of iterations per perplexity check, ignored unless
+test_set_percentage>0> \
+
+
+Topic smoothing should generally be about 50/K, where K is the number of
+topics. The number of words in the vocabulary can be an upper bound, though
+it shouldn't be too high (for memory concerns). 
+
+Choosing the number of topics is more art than science, and it's
+recommended that you try several values.
+
+After running LDA you can obtain an output of the computed topics using the
+LDAPrintTopics utility:
+
+
+    bin/mahout ldatopics \
+        -i <input vectors directory> \
+        -d <input dictionary file> \
+        -w <optional number of words to print> \
+        -o <optional output working directory. Default is to console> \
+        -h <print out help> \
+        -dt <optional dictionary type (text|sequencefile). Default is text>
+
+
+
+<a name="LatentDirichletAllocation-Example"></a>
+# Example
+
+An example is located in mahout/examples/bin/build-reuters.sh. The script
+automatically downloads the Reuters-21578 corpus, builds a Lucene index and
+converts the Lucene index to vectors. By uncommenting the last two lines
+you can then cause it to run LDA on the vectors and finally print the
+resultant topics to the console. 
+
+To adapt the example yourself, you should note that Lucene has specialized
+support for Reuters, and that building your own index will require some
+adaptation. The rest should hopefully not differ too much.
+
+<a name="LatentDirichletAllocation-ParameterEstimation"></a>
+# Parameter Estimation
+
+We use mean field variational inference to estimate the models. Variational
+inference can be thought of as a generalization of 
[EM](expectation-maximization.html)
+ for hierarchical Bayesian models. The E-Step takes the form of, for each
+document, inferring the posterior probability of each topic for each word
+in each document. We then take the sufficient statistics and emit them in
+the form of (log) pseudo-counts for each word in each topic. The M-Step is
+simply to sum these together and (log) normalize them so that we have a
+distribution over the entire vocabulary of the corpus for each topic. 
+
+In implementation, the E-Step is implemented in the Map, and the M-Step is
+executed in the reduce step, with the final normalization happening as a
+post-processing step.
+
+<a name="LatentDirichletAllocation-References"></a>
+# References
+
+[David M. Blei, Andrew Y. Ng, Michael I. Jordan, John Lafferty. 2003. Latent 
Dirichlet Allocation. 
JMLR.](-http://machinelearning.wustl.edu/mlpapers/paper_files/BleiNJ03.pdf)
+
+[Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. PNAS. 
 ](http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf)
+
+[David Hall, Dan Jurafsky, and Christopher D. Manning. 2008. Studying the 
History of Ideas Using Topic Models 
](-http://aclweb.org/anthology//D/D08/D08-1038.pdf)

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/clustering/lda-commandline.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/lda-commandline.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/lda-commandline.md
new file mode 100644
index 0000000..613e90b
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/lda-commandline.md
@@ -0,0 +1,83 @@
+---
+layout: default
+title: lda-commandline
+theme:
+   name: retro-mahout
+---
+
+<a 
name="lda-commandline-RunningLatentDirichletAllocation(algorithm)fromtheCommandLine"></a>
+# Running Latent Dirichlet Allocation (algorithm) from the Command Line
+[Since Mahout v0.6](https://issues.apache.org/jira/browse/MAHOUT-897)
+ lda has been implemented as Collapsed Variable Bayes (cvb). 
+
+Mahout's LDA can be launched from the same command line invocation whether
+you are running on a single machine in stand-alone mode or on a larger
+Hadoop cluster. The difference is determined by the $HADOOP_HOME and
+$HADOOP_CONF_DIR environment variables. If both are set to an operating
+Hadoop cluster on the target machine then the invocation will run the LDA
+algorithm on that cluster. If either of the environment variables are
+missing then the stand-alone Hadoop configuration will be invoked instead.
+
+
+
+    ./bin/mahout cvb <OPTIONS>
+
+
+* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.job
+
+
+<a name="lda-commandline-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster
+
+* Put the data: cp <PATH TO DATA> testdata
+* Run the Job: 
+
+    ./bin/mahout cvb -i testdata <OTHER OPTIONS>
+
+
+<a name="lda-commandline-Runningitonthecluster"></a>
+## Running it on the cluster
+
+* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+* Run the Job: 
+
+    export HADOOP_HOME=<Hadoop Home Directory>
+    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
+    ./bin/mahout cvb -i testdata <OTHER OPTIONS>
+
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="lda-commandline-CommandlineoptionsfromMahoutcvbversion0.8"></a>
+# Command line options from Mahout cvb version 0.8
+
+    mahout cvb -h 
+      --input (-i) input                                         Path to job 
input directory.        
+      --output (-o) output                                       The directory 
pathname for output.  
+      --maxIter (-x) maxIter                             The maximum number of 
iterations.             
+      --convergenceDelta (-cd) convergenceDelta                  The 
convergence delta value               
+      --overwrite (-ow)                                          If present, 
overwrite the output directory before running job    
+      --num_topics (-k) num_topics                               Number of 
topics to learn              
+      --num_terms (-nt) num_terms                                Vocabulary 
size   
+      --doc_topic_smoothing (-a) doc_topic_smoothing     Smoothing for 
document/topic distribution          
+      --term_topic_smoothing (-e) term_topic_smoothing   Smoothing for 
topic/term distribution          
+      --dictionary (-dict) dictionary                    Path to 
term-dictionary file(s) (glob expression supported) 
+      --doc_topic_output (-dt) doc_topic_output                  Output path 
for the training doc/topic distribution        
+      --topic_model_temp_dir (-mt) topic_model_temp_dir          Path to 
intermediate model path (useful for restarting)       
+      --iteration_block_size (-block) iteration_block_size       Number of 
iterations per perplexity check  
+      --random_seed (-seed) random_seed                          Random seed   
    
+      --test_set_fraction (-tf) test_set_fraction                Fraction of 
data to hold out for testing  
+      --num_train_threads (-ntt) num_train_threads               number of 
threads per mapper to train with  
+      --num_update_threads (-nut) num_update_threads     number of threads per 
mapper to update the model with        
+      --max_doc_topic_iters (-mipd) max_doc_topic_iters          max number of 
iterations per doc for p(topic|doc) learning              
+      --num_reduce_tasks num_reduce_tasks                        number of 
reducers to use during model estimation        
+      --backfill_perplexity                              enable backfilling of 
missing perplexity values               
+      --help (-h)                                                Print out 
help    
+      --tempDir tempDir                                          Intermediate 
output directory      
+      --startPhase startPhase                            First phase to run    
+      --endPhase endPhase                                        Last phase to 
run
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/clustering/llr---log-likelihood-ratio.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/llr---log-likelihood-ratio.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/llr---log-likelihood-ratio.md
new file mode 100644
index 0000000..d6b7e18
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/llr---log-likelihood-ratio.md
@@ -0,0 +1,46 @@
+---
+layout: default
+title: LLR - Log-likelihood Ratio
+theme:
+   name: retro-mahout
+---
+
+# Likelihood ratio test
+
+_Likelihood ratio test is used to compare the fit of two models one
+of which is nested within the other._
+
+In the context of machine learning and the Mahout project in particular,
+the term LLR is usually meant to refer to a test of significance for two
+binomial distributions, also known as the G squared statistic. This is a
+special case of the multinomial test and is closely related to mutual
+information.  The value of this statistic is not normally used in this
+context as a true frequentist test of significance since there would be
+obvious and dreadful problems to do with multiple comparisons, but rather
+as a heuristic score to order pairs of items with the most interestingly
+connected items having higher scores.  In this usage, the LLR has proven
+very useful for discriminating pairs of features that have interesting
+degrees of cooccurrence and those that do not with usefully small false
+positive and false negative rates.  The LLR is typically far more suitable
+in the case of small than many other measures such as Pearson's
+correlation, Pearson's chi squared statistic or z statistics.  The LLR as
+stated does not, however, make any use of rating data which can limit its
+applicability in problems such as the Netflix competition. 
+
+The actual value of the LLR is not usually very helpful other than as a way
+of ordering pairs of items.  As such, it is often used to determine a
+sparse set of coefficients to be estimated by other means such as TF-IDF. 
+Since the actual estimation of these coefficients can be done in a way that
+is independent of the training data such as by general corpus statistics,
+and since the ordering imposed by the LLR is relatively robust to counting
+fluctuation, this technique can provide very strong results in very sparse
+problems where the potential number of features vastly out-numbers the
+number of training examples and where features are highly interdependent.
+
+ See Also: 
+
+* [Blog post "surprise and 
coincidence"](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html)
+* [G-Test](http://en.wikipedia.org/wiki/G-test)
+* [Likelihood Ratio Test](http://en.wikipedia.org/wiki/Likelihood-ratio_test)
+
+      
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/clustering/spectral-clustering.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/spectral-clustering.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/spectral-clustering.md
new file mode 100644
index 0000000..d0f5199
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/spectral-clustering.md
@@ -0,0 +1,84 @@
+---
+layout: default
+title: Spectral Clustering
+theme:
+   name: retro-mahout
+---
+
+# Spectral Clustering Overview
+
+Spectral clustering, as its name implies, makes use of the spectrum (or 
eigenvalues) of the similarity matrix of the data. It examines the 
_connectedness_ of the data, whereas other clustering algorithms such as 
k-means use the _compactness_ to assign clusters. Consequently, in situations 
where k-means performs well, spectral clustering will also perform well. 
Additionally, there are situations in which k-means will underperform (e.g. 
concentric circles), but spectral clustering will be able to segment the 
underlying clusters. Spectral clustering is also very useful for image 
segmentation.
+
+At its simplest, spectral clustering relies on the following four steps:
+
+ 1. Computing a similarity (or _affinity_) matrix `\(\mathbf{A}\)` from the 
data. This involves determining a pairwise distance function `\(f\)` that takes 
a pair of data points and returns a scalar.
+
+ 2. Computing a graph Laplacian `\(\mathbf{L}\)` from the affinity matrix. 
There are several types of graph Laplacians; which is used will often depends 
on the situation.
+
+ 3. Computing the eigenvectors and eigenvalues of `\(\mathbf{L}\)`. The degree 
of this decomposition is often modulated by `\(k\)`, or the number of clusters. 
Put another way, `\(k\)` eigenvectors and eigenvalues are computed.
+
+ 4. The `\(k\)` eigenvectors are used as "proxy" data for the original 
dataset, and fed into k-means clustering. The resulting cluster assignments are 
transparently passed back to the original data.
+
+For more theoretical background on spectral clustering, such as how affinity 
matrices are computed, the different types of graph Laplacians, and whether the 
top or bottom eigenvectors and eigenvalues are computed, please read [Ulrike 
von Luxburg's article in _Statistics and Computing_ from December 
2007](http://link.springer.com/article/10.1007/s11222-007-9033-z). It provides 
an excellent description of the linear algebra operations behind spectral 
clustering, and imbues a thorough understanding of the types of situations in 
which it can be used.
+
+# Mahout Spectral Clustering
+
+As of Mahout 0.3, spectral clustering has been implemented to take advantage 
of the MapReduce framework. It uses 
[SSVD](http://mahout.apache.org/users/dim-reduction/ssvd.html) for 
dimensionality reduction of the input data set, and 
[k-means](http://mahout.apache.org/users/clustering/k-means-clustering.html) to 
perform the final clustering.
+
+**([MAHOUT-1538](https://issues.apache.org/jira/browse/MAHOUT-1538) will port 
the existing Hadoop MapReduce implementation to Mahout DSL, allowing for one of 
several distinct distributed back-ends to conduct the computation)**
+
+## Input
+
+The input format for the algorithm currently takes the form of a Hadoop-backed 
affinity matrix in the form of text files. Each line of the text file specifies 
a single element of the affinity matrix: the row index `\(i\)`, the column 
index `\(j\)`, and the value:
+
+`i, j, value`
+
+The affinity matrix is symmetric, and any unspecified `\(i, j\)` pairs are 
assumed to be 0 for sparsity. The row and column indices are 0-indexed. Thus, 
only the non-zero entries of either the upper or lower triangular need be 
specified.
+
+The matrix elements specified in the text files are collected into a Mahout 
`DistributedRowMatrix`.
+
+**([MAHOUT-1539](https://issues.apache.org/jira/browse/MAHOUT-1539) will allow 
for the creation of the affinity matrix to occur as part of the core spectral 
clustering algorithm, as opposed to the current requirement that the user 
create this matrix themselves and provide it, rather than the original data, to 
the algorithm)**
+
+## Running spectral clustering
+
+**([MAHOUT-1540](https://issues.apache.org/jira/browse/MAHOUT-1540) will 
provide a running example of this algorithm and this section will be updated to 
show how to run the example and what the expected output should be; until then, 
this section provides a how-to for simply running the algorithm on arbitrary 
input)**
+
+Spectral clustering can be invoked with the following arguments.
+
+    bin/mahout spectralkmeans \
+        -i <affinity matrix directory> \
+        -o <output working directory> \
+        -d <number of data points> \
+        -k <number of clusters AND number of top eigenvectors to use> \
+        -x <maximum number of k-means iterations>
+
+The affinity matrix can be contained in a single text file (using the 
aforementioned one-line-per-entry format) or span many text files [per 
(MAHOUT-978](https://issues.apache.org/jira/browse/MAHOUT-978), do not prefix 
text files with a leading underscore '_' or period '.'). The `-d` flag is 
required for the algorithm to know the dimensions of the affinity matrix. `-k` 
is the number of top eigenvectors from the normalized graph Laplacian in the 
SSVD step, and also the number of clusters given to k-means after the SSVD step.
+
+## Example
+
+To provide a simple example, take the following affinity matrix, contained in 
a text file called `affinity.txt`:
+
+    0, 0, 0
+    0, 1, 0.8
+    0, 2, 0.5
+    1, 0, 0.8
+    1, 1, 0
+    1, 2, 0.9
+    2, 0, 0.5
+    2, 1, 0.9
+    2, 2, 0
+
+With this 3-by-3 matrix, `-d` would be `3`. Furthermore, since all affinity 
matrices are assumed to be symmetric, the entries specifying both `1, 2, 0.9` 
and `2, 1, 0.9` are redundant; only one of these is needed. Additionally, any 
entries that are 0, such as those along the diagonal, also need not be 
specified at all. They are provided here for completeness.
+
+In general, larger values indicate a stronger "connectedness", whereas smaller 
values indicate a weaker connectedness. This will vary somewhat depending on 
the distance function used, though a common one is the [RBF 
kernel](http://en.wikipedia.org/wiki/RBF_kernel) (used in the above example) 
which returns values in the range [0, 1], where 0 indicates completely 
disconnected (or completely dissimilar) and 1 is fully connected (or identical).
+
+The call signature with this matrix could be as follows:
+
+    bin/mahout spectralkmeans \
+        -i s3://mahout-example/input/ \
+        -o s3://mahout-example/output/ \
+        -d 3 \
+        -k 2 \
+        -x 10
+
+There are many other optional arguments, in particular for tweaking the SSVD 
process (block size, number of power iterations, etc) and the k-means 
clustering step (distance measure, convergence delta, etc).
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/clustering/streaming-k-means.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/streaming-k-means.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/streaming-k-means.md
new file mode 100644
index 0000000..81248de
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/streaming-k-means.md
@@ -0,0 +1,174 @@
+---
+layout: default
+title: Spectral Clustering
+theme:
+   name: retro-mahout
+---
+
+# *StreamingKMeans* algorithm 
+
+The *StreamingKMeans* algorithm is a variant of Algorithm 1 from [Shindler et 
al][1] and consists of two steps:
+
+ 1. Streaming step 
+ 2. BallKMeans step. 
+
+The streaming step is a randomized algorithm that makes one pass through the 
data and 
+produces as many centroids as it determines is optimal. This step can be 
viewed as 
+a preparatory dimensionality reduction. If the size of the data stream is *n* 
and the 
+expected number of clusters is *k*, the streaming step will produce roughly 
*k\*log(n)* 
+clusters that will be passed on to the BallKMeans step which will further 
reduce the 
+number of clusters down to *k*. BallKMeans is a randomized Lloyd-type 
algorithm that
+has been studied in detail, see [Ostrovsky et al][2].
+
+## Streaming step
+
+---
+
+### Overview
+
+The streaming step is a derivative of the streaming 
+portion of Algorithm 1 in [Shindler et al][1]. The main difference between the 
two is that 
+Algorithm 1 of [Shindler et al][1] assumes 
+the knowledge of the size of the data stream and uses it to set a key 
parameter 
+for the algorithm. More precisely, the initial *distanceCutoff* (defined 
below), which is 
+denoted by *f* in [Shindler et al][1], is set to *1/(k(1+log(n))*. The 
*distanceCutoff* influences the number of clusters that the algorithm 
+will produce. 
+In contrast, Mahout implementation does not require the knowledge of the size 
of the 
+data stream. Instead, it dynamically re-evaluates the parameters that depend 
on the size 
+of the data stream at runtime as more and more data is processed. In 
particular, 
+the parameter *numClusters* (defined below) changes its value as the data is 
processed.   
+
+###Parameters
+
+ - **numClusters** (int): Conceptually, *numClusters* represents the 
algorithm's guess at the optimal 
+number of clusters it is shooting for. In particular, *numClusters* will 
increase at run 
+time as more and more data is processed. Note that â¢numClustersâ¢ is not 
the number of clusters that the algorithm will produce. Also, *numClusters* 
should not be set to the final number of clusters that we expect to receive as 
the output of *StreamingKMeans*. 
+ - **distanceCutoff** (double): a parameter representing the value of the 
distance between a point and 
+its closest centroid after which
+the new point will definitely be assigned to a new cluster. *distanceCutoff* 
can be thought 
+of as an estimate of the variable *f* from Shindler et al. The default initial 
value for 
+*distanceCutoff* is *1.0/numClusters* and *distanceCutoff* grows as a 
geometric progression with 
+common ratio *beta* (see below).    
+ - **beta** (double): a constant parameter that controls the growth of 
*distanceCutoff*. If the initial setting of *distanceCutoff* is *d0*, 
*distanceCutoff* will grow as the geometric progression with initial term *d0* 
and common ratio *beta*. The default value for *beta* is 1.3. 
+ - **clusterLogFactor** (double): a constant parameter such that 
*clusterLogFactor* *log(numProcessedPoints)* is the runtime estimate of the 
number of clusters to be produced by the streaming step. If the final number of 
clusters (that we expect *StreamingKMeans* to output) is *k*, 
*clusterLogFactor* can be set to *k*.  
+ - **clusterOvershoot** (double): a constant multiplicative slack factor that 
slows down the collapsing of clusters. The default value is 2. 
+
+
+###Algorithm 
+
+The algorithm processes the data one-by-one and makes only one pass through 
the data.
+The first point from the data stream will form the centroid of the first 
cluster (this designation may change as more points are processed). Suppose 
there are *r* clusters at one point and a new point *p* is being processed. The 
new point can either be added to one of the existing *r* clusters or become a 
new cluster. To decide:
+
+ - let *c* be the closest cluster to point *p*
+ - let *d* be the distance between *c* and *p*
+ - if *d > distanceCutoff*, create a new cluster from *p* (*p* is too far away 
from the clusters to be part of any one of them)
+ - else (*d <= distanceCutoff*), create a new cluster with probability *d / 
distanceCutoff* (the probability of creating a new cluster increases as *d* 
increases). 
+
+There will be either *r* or *r+1* clusters after processing a new point.
+
+As the number of clusters increases, it will go over the  *clusterOvershoot \* 
numClusters* limit (*numClusters* represents a recommendation for the number of 
clusters that the streaming step should aim for and *clusterOvershoot* is the 
slack). To decrease the number of clusters the existing clusters
+are treated as data points and are re-clustered (collapsed). This tends to 
make the number of clusters go down. If the number of clusters is still too 
high, *distanceCutoff* is increased.
+
+## BallKMeans step
+---
+### Overview
+The algorithm is a Lloyd-type algorithm that takes a set of weighted vectors 
and returns k centroids, see [Ostrovsky et al][2] for details. The algorithm 
has two stages: 
+ 
+ 1. Seeding 
+ 2. Ball k-means 
+
+The seeding stage is an initial guess of where the centroids should be. The 
initial guess is improved using the ball k-means stage. 
+
+### Parameters
+
+* **numClusters** (int): the number k of centroids to return.  The algorithm 
will return exactly this number of centroids.
+
+* **maxNumIterations** (int): After seeding, the iterative clustering 
procedure will be run at most *maxNumIterations* times.  1 or 2 iterations are 
recommended.  Increasing beyond this will increase the accuracy of the result 
at the expense of runtime.  Each successive iteration yields diminishing 
returns in lowering the cost.
+
+* **trimFraction** (double): Outliers are ignored when computing the center of 
mass for a cluster.  For any datapoint *x*, let *c* be the nearest centroid.  
Let *d* be the minimum distance from *c* to another centroid.  If the distance 
from *x* to *c* is greater than *trimFraction \* d*, then *x* is considered an 
outlier during that iteration of ball k-means.  The default is 9/10.  In 
[Ostrovsky et al][2], the authors use *trimFraction* = 1/3, but this does not 
mean that 1/3 is optimal in practice.
+
+* **kMeansPlusPlusInit** (boolean): If true, the seeding method is k-means++.  
If false, the seeding method is to select points uniformly at random.  The 
default is true.
+
+* **correctWeights** (boolean): If *correctWeights* is true, outliers will be 
considered when calculating the weight of centroids.  The default is true. Note 
that outliers are not considered when calculating the position of centroids.
+
+* **testProbability** (double): If *testProbability* is *p* (0 < *p* < 1), the 
data (of size n) is partitioned into a test set (of size *p\*n*) and a training 
set (of size *(1-p)\*n*).  If 0, no test set is created (the entire data set is 
used for both training and testing).  The default is 0.1 if *numRuns* > 1.  If 
*numRuns* = 1, then no test set should be created (since it is only used to 
compare the cost between different runs).
+
+* **numRuns** (int): This is the number of runs to perform. The solution of 
lowest cost is returned.  The default is 1 run.
+
+###Algorithm
+The algorithm can be instructed to take multiple independent runs (using the 
*numRuns* parameter) and the algorithm will select the best solution (i.e., the 
one with the lowest cost). In practice, one run is sufficient to find a good 
solution.  
+
+Each run operates as follows: a seeding procedure is used to select k 
centroids, and then ball k-means is run iteratively to refine the solution.
+
+The seeding procedure can be set to either 'uniformly at random' or 
'k-means++' using *kMeansPlusPlusInit* boolean variable. Seeding with k-means++ 
involves more computation but offers better results in practice. 
+ 
+Each iteration of ball k-means runs as follows:
+
+1. Clusters are formed by assigning each datapoint to the nearest centroid
+2. The centers of mass of the trimmed clusters (see *trimFraction* parameter 
above) become the new centroids 
+
+The data may be partitioned into a test set and a training set (see 
*testProbability*). The seeding procedure and ball k-means run on the training 
set. The cost is computed on the test set.
+
+
+##Usage of *StreamingKMeans*                                                   
                       
+  
+     bin/mahout streamingkmeans  
+       -i <input>  
+       -o <output> 
+       -ow  
+       -k <k>  
+       -km <estimatedNumMapClusters>  
+       -e <estimatedDistanceCutoff>  
+       -mi <maxNumIterations>  
+       -tf <trimFraction>  
+       -ri                  
+       -iw  
+       -testp <testProbability>  
+       -nbkm <numBallKMeansRuns>  
+       -dm <distanceMeasure>   
+       -sc <searcherClass>  
+       -np <numProjections>  
+       -s <searchSize>   
+       -rskm  
+       -xm <method>  
+       -h   
+       --tempDir <tempDir>   
+       --startPhase <startPhase>   
+       --endPhase <endPhase>                    
+
+
+###Details on Job-Specific Options:
+                                                           
+ * `--input (-i) <input>`: Path to job input directory.         
+ * `--output (-o) <output>`: The directory pathname for output.            
+ * `--overwrite (-ow)`: If present, overwrite the output directory before 
running job.
+ * `--numClusters (-k) <k>`: The k in k-Means. Approximately this many 
clusters will be generated.      
+ * `--estimatedNumMapClusters (-km) <estimatedNumMapClusters>`: The estimated 
number of clusters to use for the Map phase of the job when running 
StreamingKMeans. This should be around k \* log(n), where k is the final number 
of clusters and n is the total number of data points to cluster.           
+ * `--estimatedDistanceCutoff (-e) <estimatedDistanceCutoff>`: The initial 
estimated distance cutoff between two points for forming new clusters. If no 
value is given, it's estimated from the data set  
+ * `--maxNumIterations (-mi) <maxNumIterations>`: The maximum number of 
iterations to run for the BallKMeans algorithm used by the reducer. If no value 
is given, defaults to 10.    
+ * `--trimFraction (-tf) <trimFraction>`: The 'ball' aspect of ball k-means 
means that only the closest points to the centroid will actually be used for 
updating. The fraction of the points to be used is those points whose distance 
to the center is within trimFraction \* distance to the closest other center. 
If no value is given, defaults to 0.9.   
+ * `--randomInit` (`-ri`) Whether to use k-means++ initialization or random 
initialization of the seed centroids. Essentially, k-means++ provides better 
clusters, but takes longer, whereas random initialization takes less time, but 
produces worse clusters, and tends to fail more often and needs multiple runs 
to compare to k-means++. If set, uses the random initialization.
+ * `--ignoreWeights (-iw)`: Whether to correct the weights of the centroids 
after the clustering is done. The weights end up being wrong because of the 
trimFraction and possible train/test splits. In some cases, especially in a 
pipeline, having an accurate count of the weights is useful. If set, ignores 
the final weights. 
+ * `--testProbability (-testp) <testProbability>`: A double value  between 0 
and 1  that represents  the percentage of  points to be used  for 'testing'  
different  clustering runs in  the final  BallKMeans step.  If no value is  
given, defaults to  0.1  
+ * `--numBallKMeansRuns (-nbkm) <numBallKMeansRuns>`: Number of  BallKMeans 
runs to  use at the end to  try to cluster the  points. If no  value is given,  
defaults to 4  
+ * `--distanceMeasure (-dm) <distanceMeasure>`: The classname of  the  
DistanceMeasure.  Default is  SquaredEuclidean.  
+ * `--searcherClass (-sc) <searcherClass>`: The type of  searcher to be  used 
when  performing nearest  neighbor searches.  Defaults to  ProjectionSearch.  
+ * `--numProjections (-np) <numProjections>`: The number of  projections  
considered in  estimating the  distances between  vectors. Only used  when the 
distance  measure requested is either ProjectionSearch or FastProjectionSearch. 
If no value is given, defaults to 3.  
+ * `--searchSize (-s) <searchSize>`: In more efficient  searches (non  
BruteSearch), not all distances are calculated for determining the nearest 
neighbors. The number of elements whose distances from the query vector is 
actually computer is proportional to searchSize. If no value is given, defaults 
to 1.  
+ * `--reduceStreamingKMeans (-rskm)`: There might be too many intermediate 
clusters from the mapper to fit into memory, so the reducer can run  another 
pass of StreamingKMeans to collapse them down to a fewer clusters.  
+ * `--method (-xm)` method The execution  method to use:  sequential or  
mapreduce. Default  is mapreduce.  
+ * `-- help (-h)`: Print out help  
+ * `--tempDir <tempDir>`: Intermediate output directory.
+ * `--startPhase <startPhase>` First phase to run.  
+ * `--endPhase <endPhase>` Last phase to run.   
+
+
+##References
+
+1. [M. Shindler, A. Wong, A. Meyerson: Fast and Accurate k-means For Large 
Datasets][1]
+2. [R. Ostrovsky, Y. Rabani, L. Schulman, Ch. Swamy: The Effectiveness of 
Lloyd-Type Methods for the k-means Problem][2]
+
+
+[1]: http://nips.cc/Conferences/2011/Program/event.php?ID=2989 "M. Shindler, 
A. Wong, A. Meyerson: Fast and Accurate k-means For Large Datasets"
+
+[2]: http://www.math.uwaterloo.ca/~cswamy/papers/kmeansfnl.pdf "R. Ostrovsky, 
Y. Rabani, L. Schulman, Ch. Swamy: The Effectiveness of Lloyd-Type Methods for 
the k-means Problem"

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/clustering/viewing-result.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/viewing-result.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/viewing-result.md
new file mode 100644
index 0000000..4222732
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/viewing-result.md
@@ -0,0 +1,15 @@
+---
+layout: default
+title: Viewing Result
+theme:
+   name: retro-mahout
+---
+* [Algorithm Viewing pages](#ViewingResult-AlgorithmViewingpages)
+
+There are various technologies available to view the output of Mahout
+algorithms.
+* Clusters
+
+<a name="ViewingResult-AlgorithmViewingpages"></a>
+# Algorithm Viewing pages
+{pagetree:root=@self|excerpt=true|expandCollapseAll=true}

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/clustering/viewing-results.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/viewing-results.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/viewing-results.md
new file mode 100644
index 0000000..aacdd67
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/viewing-results.md
@@ -0,0 +1,49 @@
+---
+layout: default
+title: Viewing Results
+theme:
+   name: retro-mahout
+---
+<a name="ViewingResults-Intro"></a>
+# Intro
+
+Many of the Mahout libraries run as batch jobs, dumping results into Hadoop
+sequence files or other data structures.  This page is intended to
+demonstrate the various ways one might inspect the outcome of various jobs.
+ The page is organized by algorithms.
+
+<a name="ViewingResults-GeneralUtilities"></a>
+# General Utilities
+
+<a name="ViewingResults-SequenceFileDumper"></a>
+## Sequence File Dumper
+
+
+<a name="ViewingResults-Clustering"></a>
+# Clustering
+
+<a name="ViewingResults-ClusterDumper"></a>
+## Cluster Dumper
+
+Run the following to print out all options:
+
+    java  -cp "*" org.apache.mahout.utils.clustering.ClusterDumper --help
+
+
+
+<a name="ViewingResults-Example"></a>
+### Example
+
+    java  -cp "*" org.apache.mahout.utils.clustering.ClusterDumper --seqFileDir
+./solr-clust-n2/out/clusters-2
+          --dictionary ./solr-clust-n2/dictionary.txt
+          --substring 100 --pointsDir ./solr-clust-n2/out/points/
+    
+
+
+
+<a name="ViewingResults-ClusterLabels(MAHOUT-163)"></a>
+## Cluster Labels (MAHOUT-163)
+
+<a name="ViewingResults-Classification"></a>
+# Classification

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/clustering/visualizing-sample-clusters.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/visualizing-sample-clusters.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/visualizing-sample-clusters.md
new file mode 100644
index 0000000..52d07e7
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/visualizing-sample-clusters.md
@@ -0,0 +1,50 @@
+---
+layout: default
+title: Visualizing Sample Clusters
+theme:
+   name: retro-mahout
+---
+
+<a name="VisualizingSampleClusters-Introduction"></a>
+# Introduction
+
+Mahout provides examples to visualize sample clusters that gets created by
+our clustering algorithms. Note that the visualization is done by Swing 
programs. You have to be in a window system on the same
+machine you run these, or logged in via a remote desktop.
+
+For visualizing the clusters, you have to execute the Java
+classes under *org.apache.mahout.clustering.display* package in
+mahout-examples module. The easiest way to achieve this is to [setup 
Mahout](users/basics/quickstart.html) in your IDE.
+
+<a name="VisualizingSampleClusters-Visualizingclusters"></a>
+# Visualizing clusters
+
+The following classes in *org.apache.mahout.clustering.display* can be run
+without parameters to generate a sample data set and run the reference
+clustering implementations over them:
+
+1. **DisplayClustering** - generates 1000 samples from three, symmetric
+distributions. This is the same data set that is used by the following
+clustering programs. It displays the points on a screen and superimposes
+the model parameters that were used to generate the points. You can edit
+the *generateSamples()* method to change the sample points used by these
+programs.
+1. **DisplayClustering** - displays initial areas of generated points
+1. **DisplayCanopy** - uses Canopy clustering
+1. **DisplayKMeans** - uses k-Means clustering
+1. **DisplayFuzzyKMeans** - uses Fuzzy k-Means clustering
+1. **DisplaySpectralKMeans** - uses Spectral KMeans via map-reduce algorithm
+
+If you are using Eclipse, just right-click on each of the classes mentioned 
above and choose "Run As -Java Application". To run these directly from the 
command line:
+
+    cd $MAHOUT_HOME/examples
+    mvn -q exec:java 
-Dexec.mainClass=org.apache.mahout.clustering.display.DisplayClustering
+
+You can substitute other names above for *DisplayClustering*. 
+
+
+Note that some of these programs display the sample points and then 
superimpose all of the clusters from each iteration. The last iteration's 
clusters are in
+bold red and the previous several are colored (orange, yellow, green, blue,
+magenta) in order after which all earlier clusters are in light grey. This
+helps to visualize how the clusters converge upon a solution over multiple
+iterations.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/misc/mr---map-reduce.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/misc/mr---map-reduce.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/misc/mr---map-reduce.md
new file mode 100644
index 0000000..b03d6ad
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/misc/mr---map-reduce.md
@@ -0,0 +1,19 @@
+---
+layout: default
+title: MR - Map Reduce
+theme:
+   name: retro-mahout
+---
+
+{excerpt}MapReduce is a framework for processing huge datasets on certain
+kinds of distributable problems using a large number of computers (nodes),
+collectively referred to as a cluster.{excerpt} Computational processing
+can occur on data stored either in a filesystem (unstructured) or within a
+database (structured).
+
+&nbsp; Also written M/R
+
+
+&nbsp; See Also
+* 
[http://wiki.apache.org/hadoop/HadoopMapReduce](http://wiki.apache.org/hadoop/HadoopMapReduce)
+* 
[http://en.wikipedia.org/wiki/MapReduce](http://en.wikipedia.org/wiki/MapReduce)

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/misc/parallel-frequent-pattern-mining.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/misc/parallel-frequent-pattern-mining.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/misc/parallel-frequent-pattern-mining.md
new file mode 100644
index 0000000..e2978a4
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/misc/parallel-frequent-pattern-mining.md
@@ -0,0 +1,185 @@
+---
+layout: default
+title: Parallel Frequent Pattern Mining
+theme:
+    name: retro-mahout
+---
+Mahout has a Top K Parallel FPGrowth Implementation. Its based on the paper 
[http://infolab.stanford.edu/~echang/recsys08-69.pdf](http://infolab.stanford.edu/~echang/recsys08-69.pdf)
+ with some optimisations in mining the data.
+
+Given a huge transaction list, the algorithm finds all unique features(sets
+of field values) and eliminates those features whose frequency in the whole
+dataset is less that minSupport. Using these remaining features N, we find
+the top K closed patterns for each of them, generating a total of NxK
+patterns. FPGrowth Algorithm is a generic implementation, we can use any
+Object type to denote a feature. Current implementation requires you to use
+a String as the object type. You may implement a version for any object by
+creating Iterators, Convertors and TopKPatternWritable for that particular
+object. For more information please refer the package
+org.apache.mahout.fpm.pfpgrowth.convertors.string
+
+    e.g:
+     FPGrowth<String> fp = new FPGrowth<String>();
+     Set<String> features = new HashSet<String>();
+     fp.generateTopKStringFrequentPatterns(
+         new StringRecordIterator(new FileLineIterable(new File(input),
+encoding, false), pattern),
+       fp.generateFList(
+         new StringRecordIterator(new FileLineIterable(new File(input),
+encoding, false), pattern), minSupport),
+        minSupport,
+       maxHeapSize,
+       features,
+       new StringOutputConvertor(new SequenceFileOutputCollector<Text,
+TopKStringPatterns>(writer))
+      );
+
+* The first argument is the iterator of transaction in this case its
+Iterator<List<String>>
+* The second argument is the output of generateFList function, which
+returns the frequent items and their frequencies from the given database
+transaction iterator
+* The third argument is the minimum Support of the pattern to be generated
+* The fourth argument is the maximum number of patterns to be mined for
+each feature
+* The fifth argument is the set of features for which the frequent patterns
+has to be mined
+* The last argument is an output collector which takes \[key, 
value\](key,-value\.html)
+ of Feature and TopK Patterns of the format \[String,
+List<Pair<List<String>, Long>>\] and writes them to the appropriate writer
+class which takes care of storing the object, in this case in a Sequence
+File Output format
+
+<a 
name="ParallelFrequentPatternMining-RunningFrequentPatternGrowthviacommandline"></a>
+## Running Frequent Pattern Growth via command line
+
+The command line launcher for string transaction data
+org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver has other features including
+specifying the regex pattern for spitting a string line of a transaction
+into the constituent features.
+
+Input files have to be in the following format.
+
+<optional document id>TAB<TOKEN1>SPACE<TOKEN2>SPACE....
+
+instead of tab you could use , or \| as the default tokenization is done using 
a java Regex pattern {code}[,\t](,\t.html)
+*[,|\t][ ,\t]*{code}
+You can override this parameter to parse your log files or transaction
+files (each line is a transaction.) The FPGrowth algorithm mines the top K
+frequently occurring sets of items and their counts from the given input
+data
+
+$MAHOUT_HOME/core/src/test/resources/retail.dat is a sample dataset in this
+format. 
+Other sample files are accident.dat.gz from 
[http://fimi.cs.helsinki.fi/data/](http://fimi.cs.helsinki.fi/data/)
+. As a quick test, try this:
+
+
+    bin/mahout fpg \
+         -i core/src/test/resources/retail.dat \
+         -o patterns \
+         -k 50 \
+         -method sequential \
+         -regex '[\ ]
+' \
+         -s 2
+
+
+The minimumSupport parameter \-s is the minimum number of times a pattern
+or a feature needs to occur in the dataset so that it is included in the
+patterns generated. You can speed up the process by having a large value of
+s. There are cases where you will have less than k patterns for a
+particular feature as the rest don't for qualify the minimum support
+criteria
+
+Note that the input to the algorithm, could be uncompressed or compressed
+gz file or even a directory containing any number of such files.
+We modified the regex to use space to split the token. Note that input
+regex string is escaped.
+
+<a name="ParallelFrequentPatternMining-RunningParallelFPGrowth"></a>
+## Running Parallel FPGrowth
+
+Running parallel FPGrowth is as easy as adding changing the flag \-method
+mapreduce and adding the number of groups parameter e.g. \-g 20 for 20
+groups. First, let's run the above sample test in map-reduce mode:
+
+    bin/mahout fpg \
+         -i core/src/test/resources/retail.dat \
+         -o patterns \
+         -k 50 \
+         -method mapreduce \
+         -regex '[\ ]
+' \
+         -s 2
+
+The above test took 102 seconds on dual-core laptop, v.s. 609 seconds in
+the sequential mode, (with 5 gigs of ram allocated). In a separate test,
+the first 1000 lines of retail.dat took 20 seconds in map/reduce v.s. 30
+seconds in sequential mode.
+
+Here is another dataset which, while several times larger, requires much
+less time to find frequent patterns, as there are very few. Get
+accidents.dat.gz from 
[http://fimi.cs.helsinki.fi/data/](http://fimi.cs.helsinki.fi/data/)
+ and place it on your hdfs in a folder named accidents. Then, run the
+hadoop version of the FPGrowth job:
+
+    bin/mahout fpg \
+         -i accidents \
+         -o patterns \
+         -k 50 \
+         -method mapreduce \
+         -regex '[\ ]
+' \
+         -s 2
+
+
+OR to run a dataset of this size in sequential mode on a single machine
+let's give Mahout a lot more memory and only keep features with more than
+300 members:
+
+    export MAHOUT_HEAPSIZE=-Xmx5000m
+    bin/mahout fpg \
+         -i accidents \
+         -o patterns \
+         -k 50 \
+         -method sequential \
+         -regex '[\ ]
+' \
+         -s 2
+
+
+
+The numGroups parameter \-g in FPGrowthJob specifies the number of groups
+into which transactions have to be decomposed. The default of 1000 works
+very well on a single-machine cluster; this may be very different on large
+clusters.
+
+Note that accidents.dat has 340 unique features. So we chose \-g 10 to
+split the transactions across 10 shards where 34 patterns are mined from
+each shard. (Note: g doesnt need to be exactly divisible.) The Algorithm
+takes care of calculating the split. For better performance in large
+datasets and clusters, try not to mine for more than 20-25 features per
+shard. Stick to the defaults on a small machine.
+
+The numTreeCacheEntries parameter \-tc specifies the number of generated
+conditional FP-Trees to be kept in memory so that subsequent operations do
+not to regenerate them. Increasing this number increases the memory
+consumption but might improve speed until a certain point. This depends
+entirely on the dataset in question. A value of 5-10 is recommended for
+mining up to top 100 patterns for each feature.
+
+<a name="ParallelFrequentPatternMining-Viewingtheresults"></a>
+## Viewing the results
+The output will be dumped to a SequenceFile in the frequentpatterns
+directory in Text=>TopKStringPatterns format. Run this command to see a few
+of the Frequent Patterns:
+
+    bin/mahout seqdumper \
+         -i patterns/frequentpatterns/part-?-00000 \
+         -n 4
+
+or replace -n 4 with -c for the count of patterns.
+ 
+Open questions: how does one experiment and monitor with these various
+parameters?

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/misc/perceptron-and-winnow.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/misc/perceptron-and-winnow.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/misc/perceptron-and-winnow.md
new file mode 100644
index 0000000..308040c
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/misc/perceptron-and-winnow.md
@@ -0,0 +1,41 @@
+---
+layout: default
+title: Perceptron and Winnow
+theme:
+    name: retro-mahout
+---
+<a name="PerceptronandWinnow-ClassificationwithPerceptronorWinnow"></a>
+# Classification with Perceptron or Winnow
+
+Both algorithms are comparably simple linear classifiers. Given training
+data in some n-dimensional vector space that is annotated with binary
+labels the algorithms are guaranteed to find a linear separating hyperplane
+if one exists. In contrast to the Perceptron, Winnow works only for binary
+feature vectors.
+
+For more information on the Perceptron see for instance:
+http://en.wikipedia.org/wiki/Perceptron
+
+Concise course notes on both algorithms:
+http://pages.cs.wisc.edu/~shuchi/courses/787-F07/scribe-notes/lecture24.pdf
+
+Although the algorithms are comparably simple they still work pretty well
+for text classification and are fast to train even for huge example sets.
+In contrast to Naive Bayes they are not based on the assumption that all
+features (in the domain of text classification: all terms in a document)
+are independent.
+
+<a name="PerceptronandWinnow-Strategyforparallelisation"></a>
+## Strategy for parallelisation
+
+Currently the strategy for parallelisation is simple: Given there is enough
+training data, split the training data. Train the classifier on each split.
+The resulting hyperplanes are then averaged.
+
+<a name="PerceptronandWinnow-Roadmap"></a>
+## Roadmap
+
+Currently the patch only contains the code for the classifier itself. It is
+planned to provide unit tests and at least one example based on the WebKB
+dataset by the end of November for the serial version. After that the
+parallelisation will be added.

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/misc/testing.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/misc/testing.md 
b/website/old_site_migration/needs_work_convenience/map-reduce/misc/testing.md
new file mode 100644
index 0000000..dc3fd43
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/misc/testing.md
@@ -0,0 +1,46 @@
+---
+layout: default
+title: Testing
+theme:
+    name: retro-mahout
+---
+<a name="Testing-Intro"></a>
+# Intro
+
+As Mahout matures, solid testing procedures are needed.  This page and its
+children capture test plans along with ideas for improving our testing.
+
+<a name="Testing-TestPlans"></a>
+# Test Plans
+
+* [0.6](0.6.html)
+ - Test Plans for the 0.6 release
+There are no special plans except for unit tests, and user testing of the
+Hadoop jobs.
+
+<a name="Testing-TestIdeas"></a>
+# Test Ideas
+
+<a name="Testing-Regressions/Benchmarks/Integrations"></a>
+## Regressions/Benchmarks/Integrations
+* Algorithmic quality and speed are not tested, except in a few instances.
+Such tests often require much longer run times (minutes to hours), a
+running Hadoop cluster, and downloads of large datasets (in the megabytes). 
+* Standardized speed tests are difficult on different hardware. 
+* Unit tests of external integrations require access to externals: HDFS,
+S3, JDBC, Cassandra, etc. 
+
+Apache Jenkins is not able to support these environments. Commercial
+donations would help. 
+
+<a name="Testing-UnitTests"></a>
+## Unit Tests
+Mahout's current tests are almost entirely unit tests. Algorithm tests
+generally supply a few numbers to code paths and verify that expected
+numbers come out. 'mvn test' runs these tests. There is "positive" coverage
+of a great many utilities and algorithms. A much smaller percent include
+"negative" coverage (bogus setups, inputs, combinations).
+
+<a name="Testing-Other"></a>
+## Other
+

[5/9] mahout git commit: WEBSITE Triage of Old Site Migration

Reply via email to