[26/52] [partial] mahout git commit: MAHOUT-1981 Merged site updates, fixed navbars, Mathjax

rawkintrevo Mon, 04 Dec 2017 18:54:50 -0800

http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/algorithms/map-reduce/classification/partial-implementation.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/algorithms/map-reduce/classification/partial-implementation.md
 
b/website-old/docs/algorithms/map-reduce/classification/partial-implementation.md
deleted file mode 100644
index 59f701f..0000000
--- 
a/website-old/docs/algorithms/map-reduce/classification/partial-implementation.md
+++ /dev/null
@@ -1,146 +0,0 @@
----
-layout: algorithm
-title: (Deprecated)  Partial Implementation
-theme:
-    name: retro-mahout
----
-
-
-# Classifying with random forests
-
-<a name="PartialImplementation-Introduction"></a>
-# Introduction
-
-This quick start page shows how to build a decision forest using the
-partial implementation. This tutorial also explains how to use the decision
-forest to classify new data.
-Partial Decision Forests is a mapreduce implementation where each mapper
-builds a subset of the forest using only the data available in its
-partition. This allows building forests using large datasets as long as
-each partition can be loaded in-memory.
-
-<a name="PartialImplementation-Steps"></a>
-# Steps
-<a name="PartialImplementation-Downloadthedata"></a>
-## Download the data
-* The current implementation is compatible with the UCI repository file
-format. In this example we'll use the NSL-KDD dataset because its large
-enough to show the performances of the partial implementation.
-You can download the dataset here http://nsl.cs.unb.ca/NSL-KDD/
-You can either download the full training set "KDDTrain+.ARFF", or a 20%
-subset "KDDTrain+_20Percent.ARFF" (we'll use the full dataset in this
-tutorial) and the test set "KDDTest+.ARFF".
-* Open the train and test files and remove all the lines that begin with
-'@'. All those lines are at the top of the files. Actually you can keep
-those lines somewhere, because they'll help us describe the dataset to
-Mahout
-* Put the data in HDFS: {code}
-$HADOOP_HOME/bin/hadoop fs -mkdir testdata
-$HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata{code}
-
-<a name="PartialImplementation-BuildtheJobfiles"></a>
-## Build the Job files
-* In $MAHOUT_HOME/ run: {code}mvn clean install -DskipTests{code}
-
-<a name="PartialImplementation-Generateafiledescriptorforthedataset:"></a>
-## Generate a file descriptor for the dataset: 
-run the following command:
-
-    $HADOOP_HOME/bin/hadoop jar
-$MAHOUT_HOME/core/target/mahout-core-<VERSION>-job.jar
-org.apache.mahout.classifier.df.tools.Describe -p testdata/KDDTrain+.arff
--f testdata/KDDTrain+.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L
-
-The "N 3 C 2 N C 4 N C 8 N 2 C 19 N L" string describes all the attributes
-of the data. In this cases, it means 1 numerical(N) attribute, followed by
-3 Categorical(C) attributes, ...L indicates the label. You can also use 'I'
-to ignore some attributes
-
-<a name="PartialImplementation-Runtheexample"></a>
-## Run the example
-
-
-    $HADOOP_HOME/bin/hadoop jar
-$MAHOUT_HOME/examples/target/mahout-examples-<version>-job.jar
-org.apache.mahout.classifier.df.mapreduce.BuildForest
--Dmapred.max.split.size=1874231 -d testdata/KDDTrain+.arff -ds
-testdata/KDDTrain+.info -sl 5 -p -t 100 -o nsl-forest
-
-which builds 100 trees (-t argument) using the partial implementation (-p).
-Each tree is built using 5 random selected attribute per node (-sl
-argument) and the example outputs the decision tree in the "nsl-forest"
-directory (-o).
-The number of partitions is controlled by the -Dmapred.max.split.size
-argument that indicates to Hadoop the max. size of each partition, in this
-case 1/10 of the size of the dataset. Thus 10 partitions will be used.
-IMPORTANT: using less partitions should give better classification results,
-but needs a lot of memory. So if the Jobs are failing, try increasing the
-number of partitions.
-* The example outputs the Build Time and the oob error estimation
-
-
-    10/03/13 17:57:29 INFO mapreduce.BuildForest: Build Time: 0h 7m 43s 582
-    10/03/13 17:57:33 INFO mapreduce.BuildForest: oob error estimate :
-0.002325895231517865
-    10/03/13 17:57:33 INFO mapreduce.BuildForest: Storing the forest in:
-nsl-forest/forest.seq
-
-
-<a name="PartialImplementation-UsingtheDecisionForesttoClassifynewdata"></a>
-## Using the Decision Forest to Classify new data
-run the following command:
-
-    $HADOOP_HOME/bin/hadoop jar
-$MAHOUT_HOME/examples/target/mahout-examples-<version>-job.jar
-org.apache.mahout.classifier.df.mapreduce.TestForest -i
-nsl-kdd/KDDTest+.arff -ds nsl-kdd/KDDTrain+.info -m nsl-forest -a -mr -o
-predictions
-
-This will compute the predictions of "KDDTest+.arff" dataset (-i argument)
-using the same data descriptor generated for the training dataset (-ds) and
-the decision forest built previously (-m). Optionally (if the test dataset
-contains the labels of the tuples) run the analyzer to compute the
-confusion matrix (-a), and you can also store the predictions in a text
-file or a directory of text files(-o). Passing the (-mr) parameter will use
-Hadoop to distribute the classification.
-
-* The example should output the classification time and the confusion
-matrix
-
-
-    10/03/13 18:08:56 INFO mapreduce.TestForest: Classification Time: 0h 0m 6s
-355
-    10/03/13 18:08:56 INFO mapreduce.TestForest:
-=======================================================
-    Summary
-    -------------------------------------------------------
-    Correctly Classified Instances             :      17657       78.3224%
-    Incorrectly Classified Instances   :       4887       21.6776%
-    Total Classified Instances         :      22544
-    
-    =======================================================
-    Confusion Matrix
-    -------------------------------------------------------
-    a  b       <--Classified as
-    9459       252      |  9711        a     = normal
-    4635       8198     |  12833       b     = anomaly
-    Default Category: unknown: 2
-
-
-If the input is a single file then the output will be a single text file,
-in the above example 'predictions' would be one single file. If the input
-if a directory containing for example two files 'a.data' and 'b.data', then
-the output will be a directory 'predictions' containing two files
-'a.data.out' and 'b.data.out'
-
-<a name="PartialImplementation-KnownIssuesandlimitations"></a>
-## Known Issues and limitations
-The "Decision Forest" code is still "a work in progress", many features are
-still missing. Here is a list of some known issues:
-* For now, the training does not support multiple input files. The input
-dataset must be one single file (this support will be available with the 
upcoming release). 
-Classifying new data does support multiple
-input files.
-* The tree building is done when each mapper.close() method is called.
-Because the mappers don't refresh their state, the job can fail when the
-dataset is big and you try to build a large number of trees.


http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/algorithms/map-reduce/classification/random-forests.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/algorithms/map-reduce/classification/random-forests.md 
b/website-old/docs/algorithms/map-reduce/classification/random-forests.md
deleted file mode 100644
index 6310909..0000000
--- a/website-old/docs/algorithms/map-reduce/classification/random-forests.md
+++ /dev/null
@@ -1,234 +0,0 @@
----
-layout: algorithm
-title: (Deprecated)  Random Forests
-theme:
-    name: retro-mahout
----
-
-<a name="RandomForests-HowtogrowaDecisionTree"></a>
-### How to grow a Decision Tree
-
-source : \[3\](3\.html)
-
-LearnUnprunedTree(*X*,*Y*)
-
-Input: *X* a matrix of *R* rows and *M* columns where *X{*}{*}{~}ij{~}* =
-the value of the *j*'th attribute in the *i*'th input datapoint. Each
-column consists of either all real values or all categorical values.
-Input: *Y* a vector of *R* elements, where *Y{*}{*}{~}i{~}* = the output
-class of the *i*'th datapoint. The *Y{*}{*}{~}i{~}* values are categorical.
-Output: An Unpruned decision tree
-
-
-If all records in *X* have identical values in all their attributes (this
-includes the case where *R<2*), return a Leaf Node predicting the majority
-output, breaking ties randomly. This case also includes
-If all values in *Y* are the same, return a Leaf Node predicting this value
-as the output
-Else
-&nbsp;&nbsp;&nbsp; select *m* variables at random out of the *M* variables
-&nbsp;&nbsp;&nbsp; For *j* = 1 .. *m*
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; If *j*'th attribute is
-categorical
-*&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
-IG{*}{*}{~}j{~}* = IG(*Y*\|*X{*}{*}{~}j{~}*) (see Information
-Gain)&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Else (*j*'th attribute is
-real-valued)
-*&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
-IG{*}{*}{~}j{~}* = IG*(*Y*\|*X{*}{*}{~}j{~}*) (see Information Gain)
-&nbsp;&nbsp;&nbsp; Let *j\** = argmax{~}j~ *IG{*}{*}{~}j{~}* (this is the
-splitting attribute we'll use)
-&nbsp;&nbsp;&nbsp; If *j\** is categorical then
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; For each value *v* of the *j*'th
-attribute
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let
-*X{*}{*}{^}v{^}* = subset of rows of *X* in which *X{*}{*}{~}ij{~}* = *v*.
-Let *Y{*}{*}{^}v{^}* = corresponding subset of *Y*
-&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Let *Child{*}{*}{^}v{^}* =
-LearnUnprunedTree(*X{*}{*}{^}v{^}*,*Y{*}{*}{^}v{^}*)
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Return a decision tree node,
-splitting on *j*'th attribute. The number of children equals the number of
-values of the *j*'th attribute, and the *v*'th child is
-*Child{*}{*}{^}v{^}*
-&nbsp;&nbsp;&nbsp; Else *j\** is real-valued and let *t* be the best split
-threshold
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let *X{*}{*}{^}LO{^}* = subset
-of rows of *X* in which *X{*}{*}{~}ij{~}* *<= t*. Let *Y{*}{*}{^}LO{^}* =
-corresponding subset of *Y*
-&nbsp; &nbsp; &nbsp; &nbsp; Let *Child{*}{*}{^}LO{^}* =
-LearnUnprunedTree(*X{*}{*}{^}LO{^}*,*Y{*}{*}{^}LO{^}*)
-&nbsp; &nbsp; &nbsp; &nbsp; Let *X{*}{*}{^}HI{^}* = subset of rows of *X*
-in which *X{*}{*}{~}ij{~}* *> t*. Let *Y{*}{*}{^}HI{^}* = corresponding
-subset of *Y*
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let *Child{*}{*}{^}HI{^}* =
-LearnUnprunedTree(*X{*}{*}{^}HI{^}*,*Y{*}{*}{^}HI{^}*)
-&nbsp; &nbsp; &nbsp; &nbsp; Return a decision tree node, splitting on
-*j*'th attribute. It has two children corresponding to whether the *j*'th
-attribute is above or below the given threshold.
-
-*Note*: There are alternatives to Information Gain for splitting nodes
-&nbsp;
-
-<a name="RandomForests-Informationgain"></a>
-### Information gain
-
-source : \[3\](3\.html)
-1. h4. nominal attributes
-
-suppose X can have one of m values V{~}1~,V{~}2~,...,V{~}m~
-P(X=V{~}1~)=p{~}1~, P(X=V{~}2~)=p{~}2~,...,P(X=V{~}m~)=p{~}m~
-&nbsp;
-H(X)= \-sum{~}j=1{~}{^}m^ p{~}j~ log{~}2~ p{~}j~ (The entropy of X)
-H(Y\|X=v) = the entropy of Y among only those records in which X has value
-v
-H(Y\|X) = sum{~}j~ p{~}j~ H(Y\|X=v{~}j~)
-IG(Y\|X) = H(Y) - H(Y\|X)
-1. h4. real-valued attributes
-
-suppose X is real valued
-define IG(Y\|X:t) as H(Y) - H(Y\|X:t)
-define H(Y\|X:t) = H(Y\|X<t) P(X<t) + H(Y\|X>=t) P(X>=t)
-define IG*(Y\|X) = max{~}t~ IG(Y\|X:t)
-
-<a name="RandomForests-HowtogrowaRandomForest"></a>
-### How to grow a Random Forest
-
-source : \[1\](1\.html)
-
-Each tree is grown as follows:
-1. if the number of cases in the training set is *N*, sample *N* cases at
-random \-but with replacement, from the original data. This sample will be
-the training set for the growing tree.
-1. if there are *M* input variables, a number *m << M* is specified such
-that at each node, *m* variables are selected at random out of the *M* and
-the best split on these *m* is used to split the node. The value of *m* is
-held constant during the forest growing.
-1. each tree is grown to its large extent possible. There is no pruning.
-
-<a name="RandomForests-RandomForestparameters"></a>
-### Random Forest parameters
-
-source : \[2\](2\.html)
-Random Forests are easy to use, the only 2 parameters a user of the
-technique has to determine are the number of trees to be used and the
-number of variables (*m*) to be randomly selected from the available set of
-variables.
-Breinman's recommendations are to pick a large number of trees, as well as
-the square root of the number of variables for *m*.
-&nbsp;
-
-<a name="RandomForests-Howtopredictthelabelofacase"></a>
-### How to predict the label of a case
-
-Classify(*node*,*V*)
-&nbsp;&nbsp;&nbsp; Input: *node* from the decision tree, if *node.attribute
-= j* then the split is done on the *j*'th attribute
-
-&nbsp;&nbsp; &nbsp;Input: *V* a vector of *M* columns where
-*V{*}{*}{~}j{~}* = the value of the *j*'th attribute.
-&nbsp;&nbsp;&nbsp; Output: label of *V*
-
-&nbsp;&nbsp;&nbsp; If *node* is a Leaf then
-&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; Return the value predicted
-by *node*
-
-&nbsp;&nbsp; &nbsp;Else
-&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let *j =
-node.attribute*
-&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; If *j* is
-categorical then
-&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
-Let *v* = *V{*}{*}{~}j{~}*
-&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
-Let *child{*}{*}{^}v{^}* = child node corresponding to the attribute's
-value *v*
-&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
-&nbsp;&nbsp;&nbsp;&nbsp; Return Classify(*child{*}{*}{^}v{^}*,*V*)
-
-&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Else *j* is
-real-valued
-&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
-Let *t = node.threshold* (split threshold)
-&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
-&nbsp;&nbsp;&nbsp;&nbsp; If Vj < t then
-&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
-&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;&nbsp; Let *child{*}{*}{^}LO{^}* = child
-node corresponding to (*<t*)
-&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
-&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; Return
-Classify(*child{*}{*}{^}LO{^}*,*V*)
-&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
-Else
-&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
-&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; Let *child{*}{*}{^}HI{^}* =
-child node corresponding to (*>=t*)
-&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp; &nbsp;
-&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp;&nbsp; Return
-Classify(*child{*}{*}{^}HI{^}*,*V*)
-&nbsp;
-
-<a name="RandomForests-Theoutofbag(oob)errorestimation"></a>
-### The out of bag (oob) error estimation
-
-source : \[1\](1\.html)
-
-in random forests, there is no need for cross-validation or a separate test
-set to get an unbiased estimate of the test set error. It is estimated
-internally, during the run, as follows:
-* each tree is constructed using a different bootstrap sample from the
-original data. About one-third of the cases left of the bootstrap sample
-and not used in the construction of the _kth_ tree.
-* put each case left out in the construction of the _kth_ tree down the
-_kth{_}tree to get a classification. In this way, a test set classification
-is obtained for each case in about one-thrid of the trees. At the end of
-the run, take *j* to be the class that got most of the the votes every time
-case *n* was _oob_. The proportion of times that *j* is not equal to the
-true class of *n* averaged over all cases is the _oob error estimate_. This
-has proven to be unbiased in many tests.
-
-<a name="RandomForests-OtherRFuses"></a>
-### Other RF uses
-
-source : \[1\](1\.html)
-* variable importance
-* gini importance
-* proximities
-* scaling
-* prototypes
-* missing values replacement for the training set
-* missing values replacement for the test set
-* detecting mislabeled cases
-* detecting outliers
-* detecting novelties
-* unsupervised learning
-* balancing prediction error
-Please refer to \[1\](1\.html)
- for a detailed description
-
-<a name="RandomForests-References"></a>
-### References
-
-\[1\](1\.html)
-&nbsp; Random Forests - Classification Description
-&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; 
&nbsp;[http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm](http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm)
-\[2\](2\.html)
-&nbsp; B. Lariviï¿½re & D. Van Den Poel, 2004. "Predicting Customer Retention
-and Profitability by Using Random Forests and Regression Forests
-Techniques,"
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Working Papers of Faculty of
-Economics and Business Administration, Ghent University, Belgium 04/282,
-Ghent University,
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Faculty of Economics and
-Business Administration.
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Available online : 
[http://ideas.repec.org/p/rug/rugwps/04-282.html](http://ideas.repec.org/p/rug/rugwps/04-282.html)
-\[3\](3\.html)
-&nbsp; Decision Trees - Andrew W. Moore\[4\]
-&nbsp; &nbsp; &nbsp; &nbsp; http://www.cs.cmu.edu/~awm/tutorials\[1\](1\.html)
-\[4\](4\.html)
-&nbsp; Information Gain - Andrew W. Moore
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
[http://www.cs.cmu.edu/~awm/tutorials](http://www.cs.cmu.edu/~awm/tutorials)

http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/algorithms/map-reduce/classification/restricted-boltzmann-machines.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/algorithms/map-reduce/classification/restricted-boltzmann-machines.md
 
b/website-old/docs/algorithms/map-reduce/classification/restricted-boltzmann-machines.md
deleted file mode 100644
index 3b84e80..0000000
--- 
a/website-old/docs/algorithms/map-reduce/classification/restricted-boltzmann-machines.md
+++ /dev/null
@@ -1,49 +0,0 @@
----
-layout: algorithm
-title: (Deprecated)  Restricted Boltzmann Machines
-theme:
-    name: retro-mahout
----
-
-NOTE: This implementation is a Work-In-Progress, at least till September,
-2010. 
-
-The JIRA issue is [here](https://issues.apache.org/jira/browse/MAHOUT-375)
-. 
-
-<a name="RestrictedBoltzmannMachines-BoltzmannMachines"></a>
-### Boltzmann Machines
-Boltzmann Machines are a type of stochastic neural networks that closely
-resemble physical processes. They define a network of units with an overall
-energy that is evolved over a period of time, until it reaches thermal
-equilibrium. 
-
-However, the convergence speed of Boltzmann machines that have
-unconstrained connectivity is low.
-
-<a name="RestrictedBoltzmannMachines-RestrictedBoltzmannMachines"></a>
-### Restricted Boltzmann Machines
-Restricted Boltzmann Machines are a variant, that are 'restricted' in the
-sense that connections between hidden units of a single layer are _not_
-allowed. In addition, stacking multiple RBM's is also feasible, with the
-activities of the hidden units forming the base for a higher-level RBM. The
-combination of these two features renders RBM's highly usable for
-parallelization. 
-
-In the Netflix Prize, RBM's offered distinctly orthogonal predictions to
-SVD and k-NN approaches, and contributed immensely to the final solution.
-
-<a name="RestrictedBoltzmannMachines-RBM'sinApacheMahout"></a>
-### RBM's in Apache Mahout
-An implementation of Restricted Boltzmann Machines is being developed for
-Apache Mahout as a Google Summer of Code 2010 project. A recommender
-interface will also be provided. The key aims of the implementation are:
-1. Accurate - should replicate known results, including those of the Netflix
-Prize
-1. Fast - The implementation uses Map-Reduce, hence, it should be fast
-1. Scale - Should scale to large datasets, with a design whose critical
-parts don't need a dependency between the amount of memory on your cluster
-systems and the size of your dataset
-
-You can view the patch as it develops 
[here](http://github.com/sisirkoppaka/mahout-rbm/compare/trunk...rbm)
-.

http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/algorithms/map-reduce/classification/support-vector-machines.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/algorithms/map-reduce/classification/support-vector-machines.md
 
b/website-old/docs/algorithms/map-reduce/classification/support-vector-machines.md
deleted file mode 100644
index 6507f78..0000000
--- 
a/website-old/docs/algorithms/map-reduce/classification/support-vector-machines.md
+++ /dev/null
@@ -1,43 +0,0 @@
----
-layout: algorithm
-title: (Deprecated)  Support Vector Machines
-theme:
-    name: retro-mahout
----
-
-<a name="SupportVectorMachines-SupportVectorMachines"></a>
-# Support Vector Machines
-
-As with Naive Bayes, Support Vector Machines (or SVMs in short) can be used
-to solve the task of assigning objects to classes. However, the way this
-task is solved is completely different to the setting in Naive Bayes.
-
-Each object is considered to be a point in _n_ dimensional feature space,
-_n_ being the number of features used to describe the objects numerically.
-In addition each object is assigned a binary label, let us assume the
-labels are "positive" and "negative". During learning, the algorithm tries
-to find a hyperplane in that space, that perfectly separates positive from
-negative objects.
-It is trivial to think of settings where this might very well be
-impossible. To remedy this situation, objects can be assigned so called
-slack terms, that punish mistakes made during learning appropriately. That
-way, the algorithm is forced to find the hyperplane that causes the least
-number of mistakes.
-
-Another way to overcome the problem of there being no linear hyperplane to
-separate positive from negative objects is to simply project each feature
-vector into an higher dimensional feature space and search for a linear
-separating hyperplane in that new space. Usually the main problem with
-learning in high dimensional feature spaces is the so called curse of
-dimensionality. That is, there are fewer learning examples available than
-free parameters to tune. In the case of SVMs this problem is less
-detrimental, as SVMs impose additional structural constraints on their
-solutions. Each separating hyperplane needs to have a maximal margin to all
-training examples. In addition, that way, the solution may be based on the
-information encoded in only very few examples.
-
-<a name="SupportVectorMachines-Strategyforparallelization"></a>
-## Strategy for parallelization
-
-<a name="SupportVectorMachines-Designofpackages"></a>
-## Design of packages

http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/algorithms/map-reduce/clustering/canopy-clustering.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/algorithms/map-reduce/clustering/canopy-clustering.md 
b/website-old/docs/algorithms/map-reduce/clustering/canopy-clustering.md
deleted file mode 100644
index ac8a47c..0000000
--- a/website-old/docs/algorithms/map-reduce/clustering/canopy-clustering.md
+++ /dev/null
@@ -1,191 +0,0 @@
----
-layout: algorithm
-title: (Deprecated)  Canopy Clustering
-theme:
-   name: retro-mahout
----
-
-
-**_This Document refers to the deprecated Map-Reduce Version of this 
algorithm, please see [the new implementation]({{ BASE_PATH 
}}/algorithms/clustering/canopy.html)_**
-
-<a name="CanopyClustering-CanopyClustering"></a>
-# Canopy Clustering
-
-[Canopy Clustering](http://www.kamalnigam.com/papers/canopy-kdd00.pdf)
- is a very simple, fast and surprisingly accurate method for grouping
-objects into clusters. All objects are represented as a point in a
-multidimensional feature space. The algorithm uses a fast approximate
-distance metric and two distance thresholds T1 > T2 for processing. The
-basic algorithm is to begin with a set of points and remove one at random.
-Create a Canopy containing this point and iterate through the remainder of
-the point set. At each point, if its distance from the first point is < T1,
-then add the point to the cluster. If, in addition, the distance is < T2,
-then remove the point from the set. This way points that are very close to
-the original will avoid all further processing. The algorithm loops until
-the initial set is empty, accumulating a set of Canopies, each containing
-one or more points. A given point may occur in more than one Canopy.
-
-Canopy Clustering is often used as an initial step in more rigorous
-clustering techniques, such as [K-Means Clustering](k-means-clustering.html)
-. By starting with an initial clustering the number of more expensive
-distance measurements can be significantly reduced by ignoring points
-outside of the initial canopies.
-
-**WARNING**: Canopy is deprecated in the latest release and will be removed 
once streaming k-means becomes stable enough.
- 
-<a name="CanopyClustering-Strategyforparallelization"></a>
-## Strategy for parallelization
-
-Looking at the sample Hadoop implementation in 
[http://code.google.com/p/canopy-clustering/](http://code.google.com/p/canopy-clustering/)
- the processing is done in 3 M/R steps:
-1. The data is massaged into suitable input format
-1. Each mapper performs canopy clustering on the points in its input set and
-outputs its canopies' centers
-1. The reducer clusters the canopy centers to produce the final canopy
-centers
-1. The points are then clustered into these final canopies
-
-Some ideas can be found in [Cluster computing and 
MapReduce](https://www.youtube.com/watch?v=yjPBkvYh-ss&list=PLEFAB97242917704A)
- lecture video series \[by Google(r)\]; Canopy Clustering is discussed in 
[lecture #4](https://www.youtube.com/watch?v=1ZDybXl212Q)
-. Finally here is the [Wikipedia 
page](http://en.wikipedia.org/wiki/Canopy_clustering_algorithm)
-.
-
-<a name="CanopyClustering-Designofimplementation"></a>
-## Design of implementation
-
-The implementation accepts as input Hadoop SequenceFiles containing
-multidimensional points (VectorWritable). Points may be expressed either as
-dense or sparse Vectors and processing is done in two phases: Canopy
-generation and, optionally, Clustering.
-
-<a name="CanopyClustering-Canopygenerationphase"></a>
-### Canopy generation phase
-
-During the map step, each mapper processes a subset of the total points and
-applies the chosen distance measure and thresholds to generate canopies. In
-the mapper, each point which is found to be within an existing canopy will
-be added to an internal list of Canopies. After observing all its input
-vectors, the mapper updates all of its Canopies and normalizes their totals
-to produce canopy centroids which are output, using a constant key
-("centroid") to a single reducer. The reducer receives all of the initial
-centroids and again applies the canopy measure and thresholds to produce a
-final set of canopy centroids which is output (i.e. clustering the cluster
-centroids). The reducer output format is: SequenceFile(Text, Canopy) with
-the _key_ encoding the canopy identifier. 
-
-<a name="CanopyClustering-Clusteringphase"></a>
-### Clustering phase
-
-During the clustering phase, each mapper reads the Canopies produced by the
-first phase. Since all mappers have the same canopy definitions, their
-outputs will be combined during the shuffle so that each reducer (many are
-allowed here) will see all of the points assigned to one or more canopies.
-The output format will then be: SequenceFile(IntWritable,
-WeightedVectorWritable) with the _key_ encoding the canopyId. The
-WeightedVectorWritable has two fields: a double weight and a VectorWritable
-vector. Together they encode the probability that each vector is a member
-of the given canopy.
-
-<a name="CanopyClustering-RunningCanopyClustering"></a>
-## Running Canopy Clustering
-
-The canopy clustering algorithm may be run using a command-line invocation
-on CanopyDriver.main or by making a Java call to CanopyDriver.run(...).
-Both require several arguments:
-
-Invocation using the command line takes the form:
-
-
-    bin/mahout canopy \
-        -i <input vectors directory> \
-        -o <output working directory> \
-        -dm <DistanceMeasure> \
-        -t1 <T1 threshold> \
-        -t2 <T2 threshold> \
-        -t3 <optional reducer T1 threshold> \
-        -t4 <optional reducer T2 threshold> \
-        -cf <optional cluster filter size (default: 0)> \
-        -ow <overwrite output directory if present>
-        -cl <run input vector clustering after computing Canopies>
-        -xm <execution method: sequential or mapreduce>
-
-
-Invocation using Java involves supplying the following arguments:
-
-1. input: a file path string to a directory containing the input data set a
-SequenceFile(WritableComparable, VectorWritable). The sequence file _key_
-is not used.
-1. output: a file path string to an empty directory which is used for all
-output from the algorithm.
-1. measure: the fully-qualified class name of an instance of DistanceMeasure
-which will be used for the clustering.
-1. t1: the T1 distance threshold used for clustering.
-1. t2: the T2 distance threshold used for clustering.
-1. t3: the optional T1 distance threshold used by the reducer for
-clustering. If not specified, T1 is used by the reducer.
-1. t4: the optional T2 distance threshold used by the reducer for
-clustering. If not specified, T2 is used by the reducer.
-1. clusterFilter: the minimum size for canopies to be output by the
-algorithm. Affects both sequential and mapreduce execution modes, and
-mapper and reducer outputs.
-1. runClustering: a boolean indicating, if true, that the clustering step is
-to be executed after clusters have been determined.
-1. runSequential: a boolean indicating, if true, that the computation is to
-be run in memory using the reference Canopy implementation. Note: that the
-sequential implementation performs a single pass through the input vectors
-whereas the MapReduce implementation performs two passes (once in the
-mapper and again in the reducer). The MapReduce implementation will
-typically produce less clusters than the sequential implementation as a
-result.
-
-After running the algorithm, the output directory will contain:
-1. clusters-0: a directory containing SequenceFiles(Text, Canopy) produced
-by the algorithm. The Text _key_ contains the cluster identifier of the
-Canopy.
-1. clusteredPoints: (if runClustering enabled) a directory containing
-SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable _key_ is
-the canopyId. The WeightedVectorWritable _value_ is a bean containing a
-double _weight_ and a VectorWritable _vector_ where the weight indicates
-the probability that the vector is a member of the canopy. For canopy
-clustering, the weights are computed as 1/(1+distance) where the distance
-is between the cluster center and the vector using the chosen
-DistanceMeasure.
-
-<a name="CanopyClustering-Examples"></a>
-# Examples
-
-The following images illustrate Canopy clustering applied to a set of
-randomly-generated 2-d data points. The points are generated using a normal
-distribution centered at a mean location and with a constant standard
-deviation. See the README file in the 
[/examples/src/main/java/org/apache/mahout/clustering/display/README.txt](https://github.com/apache/mahout/blob/master/examples/src/main/java/org/apache/mahout/clustering/display/README.txt)
- for details on running similar examples.
-
-The points are generated as follows:
-
-* 500 samples m=\[1.0, 1.0\](1.0,-1.0\.html)
- sd=3.0
-* 300 samples m=\[1.0, 0.0\](1.0,-0.0\.html)
- sd=0.5
-* 300 samples m=\[0.0, 2.0\](0.0,-2.0\.html)
- sd=0.1
-
-In the first image, the points are plotted and the 3-sigma boundaries of
-their generator are superimposed. 
-
-![sample data](../../images/SampleData.png)
-
-In the second image, the resulting canopies are shown superimposed upon the
-sample data. Each canopy is represented by two circles, with radius T1 and
-radius T2.
-
-![canopy](../../images/Canopy.png)
-
-The third image uses the same values of T1 and T2 but only superimposes
-canopies covering more than 10% of the population. This is a bit better
-representation of the data but it still has lots of room for improvement.
-The advantage of Canopy clustering is that it is single-pass and fast
-enough to iterate runs using different T1, T2 parameters and display
-thresholds.
-
-![canopy](../../images/Canopy10.png)
-

http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/algorithms/map-reduce/clustering/cluster-dumper.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/algorithms/map-reduce/clustering/cluster-dumper.md 
b/website-old/docs/algorithms/map-reduce/clustering/cluster-dumper.md
deleted file mode 100644
index cd8cab2..0000000
--- a/website-old/docs/algorithms/map-reduce/clustering/cluster-dumper.md
+++ /dev/null
@@ -1,106 +0,0 @@
----
-layout: algorithm
-title: (Deprecated)  Cluster Dumper
-theme:
-   name: retro-mahout
----
-
-<a name="ClusterDumper-Introduction"></a>
-## Cluster Dumper - Introduction
-
-Clustering tasks in Mahout will output data in the format of a SequenceFile
-(Text, Cluster) and the Text is a cluster identifier string. To analyze
-this output we need to convert the sequence files to a human readable
-format and this is achieved using the clusterdump utility.
-
-<a 
name="ClusterDumper-Stepsforanalyzingclusteroutputusingclusterdumputility"></a>
-## Steps for analyzing cluster output using clusterdump utility
-
-After you've executed a clustering tasks (either examples or real-world),
-you can run clusterdumper in 2 modes:
-
-
-1. Hadoop Environment
-1. Standalone Java Program 
-
-
-<a name="ClusterDumper-HadoopEnvironment{anchor:HadoopEnvironment}"></a>
-### Hadoop Environment
-
-If you have setup your HADOOP_HOME environment variable, you can use the
-command line utility `mahout` to execute the ClusterDumper on Hadoop. In
-this case we wont need to get the output clusters to our local machines.
-The utility will read the output clusters present in HDFS and output the
-human-readable cluster values into our local file system. Say you've just
-executed the [synthetic control example 
](clustering-of-synthetic-control-data.html)
- and want to analyze the output, you can execute the `mahout clusterdumper` 
utility from the command line.
-
-#### CLI options:
-    --help                               Print out help        
-    --input (-i) input                   The directory containing Sequence
-                                           Files for the Clusters          
-    --output (-o) output                 The output file.  If not specified,
-                                           dumps to the console.
-    --outputFormat (-of) outputFormat    The optional output format to write
-                                           the results as. Options: TEXT, CSV, 
or GRAPH_ML              
-    --substring (-b) substring           The number of chars of the        
-                                          asFormatString() to print    
-    --pointsDir (-p) pointsDir           The directory containing points  
-                                           sequence files mapping input vectors
-                                           to their cluster.  If specified, 
-                                           then the program will output the 
-                                           points associated with a cluster 
-    --dictionary (-d) dictionary         The dictionary file.
-    --dictionaryType (-dt) dictionaryType    The dictionary file type      
-                                         (text|sequencefile)
-    --distanceMeasure (-dm) distanceMeasure  The classname of the 
DistanceMeasure.
-                                               Default is SquaredEuclidean.
-    --numWords (-n) numWords             The number of top terms to print 
-    --tempDir tempDir                    Intermediate output directory
-    --startPhase startPhase              First phase to run
-    --endPhase endPhase                  Last phase to run
-    --evaluate (-e)                      Run ClusterEvaluator and 
CDbwEvaluator over the
-                                          input. The output will be appended 
to the rest of
-                                          the output at the end.   
-
-### Standalone Java Program                                          
-
-Run the clusterdump utility as follows as a standalone Java Program through 
Eclipse. <!-- - if you are using eclipse, setup mahout-utils as a project as 
specified in [Working with Maven in 
Eclipse](../../developers/buildingmahout.html). -->
-    To execute ClusterDumper.java,
-    
-* Under mahout-utils, Right-Click on ClusterDumper.java
-* Choose Run-As, Run Configurations
-* On the left menu, click on Java Application
-* On the top-bar click on "New Launch Configuration"
-* A new launch should be automatically created with project as
-
-    "mahout-utils" and Main Class as 
"org.apache.mahout.utils.clustering.ClusterDumper"
-
-In the arguments tab, specify the below arguments
-
-
-    --seqFileDir <MAHOUT_HOME>/examples/output/clusters-10 
-    --pointsDir <MAHOUT_HOME>/examples/output/clusteredPoints 
-    --output <MAHOUT_HOME>/examples/output/clusteranalyze.txt
-    replace <MAHOUT_HOME> with the actual path of your $MAHOUT_HOME
-
-* Hit run to execute the ClusterDumper using Eclipse. Setting breakpoints etc 
should just work fine.
-    
-Reading the output file
-    
-This will output the clusters into a file called clusteranalyze.txt inside 
$MAHOUT_HOME/examples/output
-Sample data will look like
-
-CL-0 { n=116 c=[29.922, 30.407, 30.373, 30.094, 29.886, 29.937, 29.751, 
30.054, 30.039, 30.126, 29.764, 29.835, 30.503, 29.876, 29.990, 29.605, 29.379, 
30.120, 29.882, 30.161, 29.825, 30.074, 30.001, 30.421, 29.867, 29.736, 29.760, 
30.192, 30.134, 30.082, 29.962, 29.512, 29.736, 29.594, 29.493, 29.761, 29.183, 
29.517, 29.273, 29.161, 29.215, 29.731, 29.154, 29.113, 29.348, 28.981, 29.543, 
29.192, 29.479, 29.406, 29.715, 29.344, 29.628, 29.074, 29.347, 29.812, 29.058, 
29.177, 29.063, 
29.607](29.922,-30.407,-30.373,-30.094,-29.886,-29.937,-29.751,-30.054,-30.039,-30.126,-29.764,-29.835,-30.503,-29.876,-29.990,-29.605,-29.379,-30.120,-29.882,-30.161,-29.825,-30.074,-30.001,-30.421,-29.867,-29.736,-29.760,-30.192,-30.134,-30.082,-29.962,-29.512,-29.736,-29.594,-29.493,-29.761,-29.183,-29.517,-29.273,-29.161,-29.215,-29.731,-29.154,-29.113,-29.348,-28.981,-29.543,-29.192,-29.479,-29.406,-29.715,-29.344,-29.628,-29.074,-29.347,-29.812,-29.058,-29.177,-29.063,-29.607.html)
- r=[3.463, 3.351, 3.452, 3.438, 3.371, 3.569, 3.253, 3.531, 3.439, 3.472,
-3.402, 3.459, 3.320, 3.260, 3.430, 3.452, 3.320, 3.499, 3.302, 3.511,
-3.520, 3.447, 3.516, 3.485, 3.345, 3.178, 3.492, 3.434, 3.619, 3.483,
-3.651, 3.833, 3.812, 3.433, 4.133, 3.855, 4.123, 3.999, 4.467, 4.731,
-4.539, 4.956, 4.644, 4.382, 4.277, 4.918, 4.784, 4.582, 4.915, 4.607,
-4.672, 4.577, 5.035, 5.241, 4.731, 4.688, 4.685, 4.657, 4.912, 4.300] }
-
-and on...
-
-where CL-0 is the Cluster 0 and n=116 refers to the number of points observed 
by this cluster and c = \[29.922 ...\]
- refers to the center of Cluster as a vector and r = \[3.463 ..\] refers to
-the radius of the cluster as a vector.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/algorithms/map-reduce/clustering/expectation-maximization.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/algorithms/map-reduce/clustering/expectation-maximization.md 
b/website-old/docs/algorithms/map-reduce/clustering/expectation-maximization.md
deleted file mode 100644
index e020d60..0000000
--- 
a/website-old/docs/algorithms/map-reduce/clustering/expectation-maximization.md
+++ /dev/null
@@ -1,62 +0,0 @@
----
-layout: algorithm
-title: (Deprecated)  Expectation Maximization
-theme:
-   name: retro-mahout
----
-<a name="ExpectationMaximization-ExpectationMaximization"></a>
-# Expectation Maximization
-
-The principle of EM can be applied to several learning settings, but is
-most commonly associated with clustering. The main principle of the
-algorithm is comparable to k-Means. Yet in contrast to hard cluster
-assignments, each object is given some probability to belong to a cluster.
-Accordingly cluster centers are recomputed based on the average of all
-objects weighted by their probability of belonging to the cluster at hand.
-
-<a name="ExpectationMaximization-Canopy-modifiedEM"></a>
-## Canopy-modified EM
-
-One can also use the canopies idea to speed up prototypebased clustering
-methods like K-means and Expectation-Maximization (EM). In general, neither
-K-means nor EMspecify how many clusters to use. The canopies technique does
-not help this choice.
-
-Prototypes (our estimates of the cluster centroids) are associated with the
-canopies that contain them, and the prototypes are only influenced by data
-that are inside their associated canopies. After creating the canopies, we
-decide how many prototypes will be created for each canopy. This could be
-done, for example, using the number of data points in a canopy and AIC or
-BIC where points that occur in more than one canopy are counted
-fractionally. Then we place prototypesinto each canopy. This initial
-placement can be random, as long as it is within the canopy in question, as
-determined by the inexpensive distance metric.
-
-Then, instead of calculating the distance from each prototype to every
-point (as is traditional, a O(nk) operation), theE-step instead calculates
-the distance from each prototype to a much smaller number of points. For
-each prototype, we find the canopies that contain it (using the cheap
-distance metric), and only calculate distances (using the expensive
-distance metric) from that prototype to points within those canopies.
-
-Note that by this procedure prototypes may move across canopy boundaries
-when canopies overlap. Prototypes may move to cover the data in the
-overlapping region, and then move entirely into another canopy in order to
-cover data there.
-
-The canopy-modified EM algorithm behaves very similarly to traditional EM,
-with the slight difference that points outside the canopy have no influence
-on points in the canopy, rather than a minute influence. If the canopy
-property holds, and points in the same cluster fall in the same canopy,
-then the canopy-modified EM will almost always converge to the same maximum
-in likelihood as the traditional EM. In fact, the difference in each
-iterative step (apart from the enormous computational savings of computing
-fewer terms) will be negligible since points outside the canopy will have
-exponentially small influence.
-
-<a name="ExpectationMaximization-StrategyforParallelization"></a>
-## Strategy for Parallelization
-
-<a name="ExpectationMaximization-Map/ReduceImplementation"></a>
-## Map/Reduce Implementation
-

http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/algorithms/map-reduce/clustering/fuzzy-k-means.md
----------------------------------------------------------------------
diff --git a/website-old/docs/algorithms/map-reduce/clustering/fuzzy-k-means.md 
b/website-old/docs/algorithms/map-reduce/clustering/fuzzy-k-means.md
deleted file mode 100644
index ada1153..0000000
--- a/website-old/docs/algorithms/map-reduce/clustering/fuzzy-k-means.md
+++ /dev/null
@@ -1,184 +0,0 @@
----
-layout: algorithm
-title: (Deprecated)  Fuzzy K-Means
-theme:
-   name: retro-mahout
----
-
-Fuzzy K-Means (also called Fuzzy C-Means) is an extension of 
[K-Means](http://mahout.apache.org/users/clustering/k-means-clustering.html)
-, the popular simple clustering technique. While K-Means discovers hard
-clusters (a point belong to only one cluster), Fuzzy K-Means is a more
-statistically formalized method and discovers soft clusters where a
-particular point can belong to more than one cluster with certain
-probability.
-
-<a name="FuzzyK-Means-Algorithm"></a>
-#### Algorithm
-
-Like K-Means, Fuzzy K-Means works on those objects which can be represented
-in n-dimensional vector space and a distance measure is defined.
-The algorithm is similar to k-means.
-
-* Initialize k clusters
-* Until converged
-    * Compute the probability of a point belong to a cluster for every 
<point,cluster> pair
-    * Recompute the cluster centers using above probability membership values 
of points to clusters
-
-<a name="FuzzyK-Means-DesignImplementation"></a>
-#### Design Implementation
-
-The design is similar to K-Means present in Mahout. It accepts an input
-file containing vector points. User can either provide the cluster centers
-as input or can allow canopy algorithm to run and create initial clusters.
-
-Similar to K-Means, the program doesn't modify the input directories. And
-for every iteration, the cluster output is stored in a directory cluster-N.
-The code has set number of reduce tasks equal to number of map tasks. So,
-those many part-0
-  
-  
-Files are created in clusterN directory. The code uses
-driver/mapper/combiner/reducer as follows:
-
-FuzzyKMeansDriver - This is similar to&nbsp; KMeansDriver. It iterates over
-input points and cluster points for specified number of iterations or until
-it is converged.During every iteration i, a new cluster-i directory is
-created which contains the modified cluster centers obtained during
-FuzzyKMeans iteration. This will be feeded as input clusters in the next
-iteration.&nbsp; Once Fuzzy KMeans is run for specified number of
-iterations or until it is converged, a map task is run to output "the point
-and the cluster membership to each cluster" pair as final output to a
-directory named "points".
-
-FuzzyKMeansMapper - reads the input cluster during its configure() method,
-then&nbsp; computes cluster membership probability of a point to each
-cluster.Cluster membership is inversely propotional to the distance.
-Distance is computed using&nbsp; user supplied distance measure. Output key
-is encoded clusterId. Output values are ClusterObservations containing
-observation statistics.
-
-FuzzyKMeansCombiner - receives all key:value pairs from the mapper and
-produces partial sums of the cluster membership probability times input
-vectors for each cluster. Output key is: encoded cluster identifier. Output
-values are ClusterObservations containing observation statistics.
-
-FuzzyKMeansReducer - Multiple reducers receives certain keys and all values
-associated with those keys. The reducer sums the values to produce a new
-centroid for the cluster which is output. Output key is: encoded cluster
-identifier (e.g. "C14". Output value is: formatted cluster identifier (e.g.
-"C14"). The reducer encodes unconverged clusters with a 'Cn' cluster Id and
-converged clusters with 'Vn' clusterId.
-
-<a name="FuzzyK-Means-RunningFuzzyk-MeansClustering"></a>
-## Running Fuzzy k-Means Clustering
-
-The Fuzzy k-Means clustering algorithm may be run using a command-line
-invocation on FuzzyKMeansDriver.main or by making a Java call to
-FuzzyKMeansDriver.run(). 
-
-Invocation using the command line takes the form:
-
-
-    bin/mahout fkmeans \
-        -i <input vectors directory> \
-        -c <input clusters directory> \
-        -o <output working directory> \
-        -dm <DistanceMeasure> \
-        -m <fuzziness argument >1> \
-        -x <maximum number of iterations> \
-        -k <optional number of initial clusters to sample from input vectors> \
-        -cd <optional convergence delta. Default is 0.5> \
-        -ow <overwrite output directory if present>
-        -cl <run input vector clustering after computing Clusters>
-        -e <emit vectors to most likely cluster during clustering>
-        -t <threshold to use for clustering if -e is false>
-        -xm <execution method: sequential or mapreduce>
-
-
-*Note:* if the -k argument is supplied, any clusters in the -c directory
-will be overwritten and -k random points will be sampled from the input
-vectors to become the initial cluster centers.
-
-Invocation using Java involves supplying the following arguments:
-
-1. input: a file path string to a directory containing the input data set a
-SequenceFile(WritableComparable, VectorWritable). The sequence file _key_
-is not used.
-1. clustersIn: a file path string to a directory containing the initial
-clusters, a SequenceFile(key, SoftCluster | Cluster | Canopy). Fuzzy
-k-Means SoftClusters, k-Means Clusters and Canopy Canopies may be used for
-the initial clusters.
-1. output: a file path string to an empty directory which is used for all
-output from the algorithm.
-1. measure: the fully-qualified class name of an instance of DistanceMeasure
-which will be used for the clustering.
-1. convergence: a double value used to determine if the algorithm has
-converged (clusters have not moved more than the value in the last
-iteration)
-1. max-iterations: the maximum number of iterations to run, independent of
-the convergence specified
-1. m: the "fuzzyness" argument, a double > 1. For m equal to 2, this is
-equivalent to normalising the coefficient linearly to make their sum 1.
-When m is close to 1, then the cluster center closest to the point is given
-much more weight than the others, and the algorithm is similar to k-means.
-1. runClustering: a boolean indicating, if true, that the clustering step is
-to be executed after clusters have been determined.
-1. emitMostLikely: a boolean indicating, if true, that the clustering step
-should only emit the most likely cluster for each clustered point.
-1. threshold: a double indicating, if emitMostLikely is false, the cluster
-probability threshold used for emitting multiple clusters for each point. A
-value of 0 will emit all clusters with their associated probabilities for
-each vector.
-1. runSequential: a boolean indicating, if true, that the algorithm is to
-use the sequential reference implementation running in memory.
-
-After running the algorithm, the output directory will contain:
-1. clusters-N: directories containing SequenceFiles(Text, SoftCluster)
-produced by the algorithm for each iteration. The Text _key_ is a cluster
-identifier string.
-1. clusteredPoints: (if runClustering enabled) a directory containing
-SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable _key_ is
-the clusterId. The WeightedVectorWritable _value_ is a bean containing a
-double _weight_ and a VectorWritable _vector_ where the weights are
-computed as 1/(1+distance) where the distance is between the cluster center
-and the vector using the chosen DistanceMeasure. 
-
-<a name="FuzzyK-Means-Examples"></a>
-# Examples
-
-The following images illustrate Fuzzy k-Means clustering applied to a set
-of randomly-generated 2-d data points. The points are generated using a
-normal distribution centered at a mean location and with a constant
-standard deviation. See the README file in the 
[/examples/src/main/java/org/apache/mahout/clustering/display/README.txt](https://github.com/apache/mahout/blob/master/examples/src/main/java/org/apache/mahout/clustering/display/README.txt)
- for details on running similar examples.
-
-The points are generated as follows:
-
-* 500 samples m=\[1.0, 1.0\](1.0,-1.0\.html)
- sd=3.0
-* 300 samples m=\[1.0, 0.0\](1.0,-0.0\.html)
- sd=0.5
-* 300 samples m=\[0.0, 2.0\](0.0,-2.0\.html)
- sd=0.1
-
-In the first image, the points are plotted and the 3-sigma boundaries of
-their generator are superimposed. 
-
-![fuzzy]({{ BASE_PATH }}/assets/img/SampleData.png)
-
-In the second image, the resulting clusters (k=3) are shown superimposed upon 
the sample data. As Fuzzy k-Means is an iterative algorithm, the centers of the 
clusters in each recent iteration are shown using different colors. Bold red is 
the final clustering and previous iterations are shown in \[orange, yellow, 
green, blue, violet and 
gray\](orange,-yellow,-green,-blue,-violet-and-gray\.html)
-. Although it misses a lot of the points and cannot capture the original,
-superimposed cluster centers, it does a decent job of clustering this data.
-
-![fuzzy]({{ BASE_PATH }}/assets/img/FuzzyKMeans.png)
-
-The third image shows the results of running Fuzzy k-Means on a different
-data set which is generated using asymmetrical standard deviations.
-Fuzzy k-Means does a fair job handling this data set as well.
-
-![fuzzy]({{ BASE_PATH }}/assets/img/2dFuzzyKMeans.png)
-
-<a name="FuzzyK-Means-References&nbsp;"></a>
-#### References&nbsp;
-
-* 
[http://en.wikipedia.org/wiki/Fuzzy_clustering](http://en.wikipedia.org/wiki/Fuzzy_clustering)
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/algorithms/map-reduce/clustering/hierarchical-clustering.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/algorithms/map-reduce/clustering/hierarchical-clustering.md 
b/website-old/docs/algorithms/map-reduce/clustering/hierarchical-clustering.md
deleted file mode 100644
index 70a62fe..0000000
--- 
a/website-old/docs/algorithms/map-reduce/clustering/hierarchical-clustering.md
+++ /dev/null
@@ -1,15 +0,0 @@
----
-layout: algorithm
-title: (Deprecated)  Hierarchical Clustering
-theme:
-   name: retro-mahout
----
-Hierarchical clustering is the process or finding bigger clusters, and also
-the smaller clusters inside the bigger clusters.
-
-In Apache Mahout, separate algorithms can be used for finding clusters at
-different levels. 
-
-See [Top Down 
Clustering](https://cwiki.apache.org/confluence/display/MAHOUT/Top+Down+Clustering)
-.
-

http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/algorithms/map-reduce/clustering/k-means-clustering.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/algorithms/map-reduce/clustering/k-means-clustering.md 
b/website-old/docs/algorithms/map-reduce/clustering/k-means-clustering.md
deleted file mode 100644
index 046fe0b..0000000
--- a/website-old/docs/algorithms/map-reduce/clustering/k-means-clustering.md
+++ /dev/null
@@ -1,182 +0,0 @@
----
-layout: algorithm
-title: (Deprecated)  K-Means Clustering
-theme:
-   name: retro-mahout
----
-
-# k-Means clustering - basics
-
-[k-Means](http://en.wikipedia.org/wiki/Kmeans) is a simple but well-known 
algorithm for grouping objects, clustering. All objects need to be represented
-as a set of numerical features. In addition, the user has to specify the
-number of groups (referred to as *k*) she wishes to identify.
-
-Each object can be thought of as being represented by some feature vector
-in an _n_ dimensional space, _n_ being the number of all features used to
-describe the objects to cluster. The algorithm then randomly chooses _k_
-points in that vector space, these point serve as the initial centers of
-the clusters. Afterwards all objects are each assigned to the center they
-are closest to. Usually the distance measure is chosen by the user and
-determined by the learning task.
-
-After that, for each cluster a new center is computed by averaging the
-feature vectors of all objects assigned to it. The process of assigning
-objects and recomputing centers is repeated until the process converges.
-The algorithm can be proven to converge after a finite number of
-iterations.
-
-Several tweaks concerning distance measure, initial center choice and
-computation of new average centers have been explored, as well as the
-estimation of the number of clusters _k_. Yet the main principle always
-remains the same.
-
-
-
-<a name="K-MeansClustering-Quickstart"></a>
-## Quickstart
-
-[Here](https://github.com/apache/mahout/blob/master/examples/bin/cluster-reuters.sh)
- is a short shell script outline that will get you started quickly with
-k-means. This does the following:
-
-* Accepts clustering type: *kmeans*, *fuzzykmeans*, *lda*, or *streamingkmeans*
-* Gets the Reuters dataset
-* Runs org.apache.lucene.benchmark.utils.ExtractReuters to generate
-reuters-out from reuters-sgm (the downloaded archive)
-* Runs seqdirectory to convert reuters-out to SequenceFile format
-* Runs seq2sparse to convert SequenceFiles to sparse vector format
-* Runs k-means with 20 clusters
-* Runs clusterdump to show results
-
-After following through the output that scrolls past, reading the code will
-offer you a better understanding.
-
-
-<a name="K-MeansClustering-Designofimplementation"></a>
-## Implementation
-
-The implementation accepts two input directories: one for the data points
-and one for the initial clusters. The data directory contains multiple
-input files of SequenceFile(Key, VectorWritable), while the clusters
-directory contains one or more SequenceFiles(Text, Cluster)
-containing _k_ initial clusters or canopies. None of the input directories
-are modified by the implementation, allowing experimentation with initial
-clustering and convergence values.
-
-Canopy clustering can be used to compute the initial clusters for k-KMeans:
-
-    // run the CanopyDriver job
-    CanopyDriver.runJob("testdata", "output"
-    ManhattanDistanceMeasure.class.getName(), (float) 3.1, (float) 2.1, false);
-
-    // now run the KMeansDriver job
-    KMeansDriver.runJob("testdata", "output/clusters-0", "output",
-    EuclideanDistanceMeasure.class.getName(), "0.001", "10", true);
-
-
-In the above example, the input data points are stored in 'testdata' and
-the CanopyDriver is configured to output to the 'output/clusters-0'
-directory. Once the driver executes it will contain the canopy definition
-files. Upon running the KMeansDriver the output directory will have two or
-more new directories: 'clusters-N'' containining the clusters for each
-iteration and 'clusteredPoints' will contain the clustered data points.
-
-This diagram shows the examplary dataflow of the k-Means example
-implementation provided by Mahout:
-<img src="../../images/Example implementation of k-Means provided with 
Mahout.png">
-
-
-<a name="K-MeansClustering-Runningk-MeansClustering"></a>
-## Running k-Means Clustering
-
-The k-Means clustering algorithm may be run using a command-line invocation
-on KMeansDriver.main or by making a Java call to KMeansDriver.runJob().
-
-Invocation using the command line takes the form:
-
-
-    bin/mahout kmeans \
-        -i <input vectors directory> \
-        -c <input clusters directory> \
-        -o <output working directory> \
-        -k <optional number of initial clusters to sample from input vectors> \
-        -dm <DistanceMeasure> \
-        -x <maximum number of iterations> \
-        -cd <optional convergence delta. Default is 0.5> \
-        -ow <overwrite output directory if present>
-        -cl <run input vector clustering after computing Canopies>
-        -xm <execution method: sequential or mapreduce>
-
-
-Note: if the \-k argument is supplied, any clusters in the \-c directory
-will be overwritten and \-k random points will be sampled from the input
-vectors to become the initial cluster centers.
-
-Invocation using Java involves supplying the following arguments:
-
-1. input: a file path string to a directory containing the input data set a
-SequenceFile(WritableComparable, VectorWritable). The sequence file _key_
-is not used.
-1. clusters: a file path string to a directory containing the initial
-clusters, a SequenceFile(key, Cluster \| Canopy). Both KMeans clusters and
-Canopy canopies may be used for the initial clusters.
-1. output: a file path string to an empty directory which is used for all
-output from the algorithm.
-1. distanceMeasure: the fully-qualified class name of an instance of
-DistanceMeasure which will be used for the clustering.
-1. convergenceDelta: a double value used to determine if the algorithm has
-converged (clusters have not moved more than the value in the last
-iteration)
-1. maxIter: the maximum number of iterations to run, independent of the
-convergence specified
-1. runClustering: a boolean indicating, if true, that the clustering step is
-to be executed after clusters have been determined.
-1. runSequential: a boolean indicating, if true, that the k-means sequential
-implementation is to be used to process the input data.
-
-After running the algorithm, the output directory will contain:
-1. clusters-N: directories containing SequenceFiles(Text, Cluster) produced
-by the algorithm for each iteration. The Text _key_ is a cluster identifier
-string.
-1. clusteredPoints: (if \--clustering enabled) a directory containing
-SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable _key_ is
-the clusterId. The WeightedVectorWritable _value_ is a bean containing a
-double _weight_ and a VectorWritable _vector_ where the weight indicates
-the probability that the vector is a member of the cluster. For k-Means
-clustering, the weights are computed as 1/(1+distance) where the distance
-is between the cluster center and the vector using the chosen
-DistanceMeasure.
-
-<a name="K-MeansClustering-Examples"></a>
-# Examples
-
-The following images illustrate k-Means clustering applied to a set of
-randomly-generated 2-d data points. The points are generated using a normal
-distribution centered at a mean location and with a constant standard
-deviation. See the README file in the 
[/examples/src/main/java/org/apache/mahout/clustering/display/README.txt](https://github.com/apache/mahout/blob/master/examples/src/main/java/org/apache/mahout/clustering/display/README.txt)
- for details on running similar examples.
-
-The points are generated as follows:
-
-* 500 samples m=\[1.0, 1.0\](1.0,-1.0\.html)
- sd=3.0
-* 300 samples m=\[1.0, 0.0\](1.0,-0.0\.html)
- sd=0.5
-* 300 samples m=\[0.0, 2.0\](0.0,-2.0\.html)
- sd=0.1
-
-In the first image, the points are plotted and the 3-sigma boundaries of
-their generator are superimposed.
-
-![Sample data graph](../../images/SampleData.png)
-
-In the second image, the resulting clusters (k=3) are shown superimposed upon 
the sample data. As k-Means is an iterative algorithm, the centers of the 
clusters in each recent iteration are shown using different colors. Bold red is 
the final clustering and previous iterations are shown in \[orange, yellow, 
green, blue, violet and 
gray\](orange,-yellow,-green,-blue,-violet-and-gray\.html)
-. Although it misses a lot of the points and cannot capture the original,
-superimposed cluster centers, it does a decent job of clustering this data.
-
-![kmeans](../../images/KMeans.png)
-
-The third image shows the results of running k-Means on a different dataset, 
which is generated using asymmetrical standard deviations.
-K-Means does a fair job handling this data set as well.
-
-![2d kmeans](../../images/2dKMeans.png)
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/algorithms/map-reduce/clustering/latent-dirichlet-allocation.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/algorithms/map-reduce/clustering/latent-dirichlet-allocation.md
 
b/website-old/docs/algorithms/map-reduce/clustering/latent-dirichlet-allocation.md
deleted file mode 100644
index 01290b3..0000000
--- 
a/website-old/docs/algorithms/map-reduce/clustering/latent-dirichlet-allocation.md
+++ /dev/null
@@ -1,155 +0,0 @@
----
-layout: algorithm
-title: (Deprecated)  Latent Dirichlet Allocation
-theme:
-   name: retro-mahout
----
-
-<a name="LatentDirichletAllocation-Overview"></a>
-# Overview
-
-Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning
-algorithm for automatically and jointly clustering words into "topics" and
-documents into mixtures of topics. It has been successfully applied to
-model change in scientific fields over time (Griffiths and Steyvers, 2004;
-Hall, et al. 2008). 
-
-A topic model is, roughly, a hierarchical Bayesian model that associates
-with each document a probability distribution over "topics", which are in
-turn distributions over words. For instance, a topic in a collection of
-newswire might include words about "sports", such as "baseball", "home
-run", "player", and a document about steroid use in baseball might include
-"sports", "drugs", and "politics". Note that the labels "sports", "drugs",
-and "politics", are post-hoc labels assigned by a human, and that the
-algorithm itself only assigns associate words with probabilities. The task
-of parameter estimation in these models is to learn both what the topics
-are, and which documents employ them in what proportions.
-
-Another way to view a topic model is as a generalization of a mixture model
-like [Dirichlet Process 
Clustering](http://en.wikipedia.org/wiki/Dirichlet_process)
-. Starting from a normal mixture model, in which we have a single global
-mixture of several distributions, we instead say that _each_ document has
-its own mixture distribution over the globally shared mixture components.
-Operationally in Dirichlet Process Clustering, each document has its own
-latent variable drawn from a global mixture that specifies which model it
-belongs to, while in LDA each word in each document has its own parameter
-drawn from a document-wide mixture.
-
-The idea is that we use a probabilistic mixture of a number of models that
-we use to explain some observed data. Each observed data point is assumed
-to have come from one of the models in the mixture, but we don't know
-which. The way we deal with that is to use a so-called latent parameter
-which specifies which model each data point came from.
-
-<a name="LatentDirichletAllocation-CollapsedVariationalBayes"></a>
-# Collapsed Variational Bayes
-The CVB algorithm which is implemented in Mahout for LDA combines
-advantages of both regular Variational Bayes and Gibbs Sampling.  The
-algorithm relies on modeling dependence of parameters on latest variables
-which are in turn mutually independent.   The algorithm uses 2
-methodologies to marginalize out parameters when calculating the joint
-distribution and the other other is to model the posterior of theta and phi
-given the inputs z and x.
-
-A common solution to the CVB algorithm is to compute each expectation term
-by using simple Gaussian approximation which is accurate and requires low
-computational overhead.  The specifics behind the approximation involve
-computing the sum of the means and variances of the individual Bernoulli
-variables.
-
-CVB with Gaussian approximation is implemented by tracking the mean and
-variance and subtracting the mean and variance of the corresponding
-Bernoulli variables.  The computational cost for the algorithm scales on
-the order of O(K) with each update to q(z(i,j)).  Also for each
-document/word pair only 1 copy of the variational posterior is required
-over the latent variable.
-
-<a name="LatentDirichletAllocation-InvocationandUsage"></a>
-# Invocation and Usage
-
-Mahout's implementation of LDA operates on a collection of SparseVectors of
-word counts. These word counts should be non-negative integers, though
-things will-- probably --work fine if you use non-negative reals. (Note
-that the probabilistic model doesn't make sense if you do!) To create these
-vectors, it's recommended that you follow the instructions in [Creating 
Vectors From Text](../basics/creating-vectors-from-text.html)
-, making sure to use TF and not TFIDF as the scorer.
-
-Invocation takes the form:
-
-
-    bin/mahout cvb \
-        -i <input path for document vectors> \
-        -dict <path to term-dictionary file(s) , glob expression supported> \
-        -o <output path for topic-term distributions>
-        -dt <output path for doc-topic distributions> \
-        -k <number of latent topics> \
-        -nt <number of unique features defined by input document vectors> \
-        -mt <path to store model state after each iteration> \
-        -maxIter <max number of iterations> \
-        -mipd <max number of iterations per doc for learning> \
-        -a <smoothing for doc topic distributions> \
-        -e <smoothing for term topic distributions> \
-        -seed <random seed> \
-        -tf <fraction of data to hold for testing> \
-        -block <number of iterations per perplexity check, ignored unless
-test_set_percentage>0> \
-
-
-Topic smoothing should generally be about 50/K, where K is the number of
-topics. The number of words in the vocabulary can be an upper bound, though
-it shouldn't be too high (for memory concerns). 
-
-Choosing the number of topics is more art than science, and it's
-recommended that you try several values.
-
-After running LDA you can obtain an output of the computed topics using the
-LDAPrintTopics utility:
-
-
-    bin/mahout ldatopics \
-        -i <input vectors directory> \
-        -d <input dictionary file> \
-        -w <optional number of words to print> \
-        -o <optional output working directory. Default is to console> \
-        -h <print out help> \
-        -dt <optional dictionary type (text|sequencefile). Default is text>
-
-
-
-<a name="LatentDirichletAllocation-Example"></a>
-# Example
-
-An example is located in mahout/examples/bin/build-reuters.sh. The script
-automatically downloads the Reuters-21578 corpus, builds a Lucene index and
-converts the Lucene index to vectors. By uncommenting the last two lines
-you can then cause it to run LDA on the vectors and finally print the
-resultant topics to the console. 
-
-To adapt the example yourself, you should note that Lucene has specialized
-support for Reuters, and that building your own index will require some
-adaptation. The rest should hopefully not differ too much.
-
-<a name="LatentDirichletAllocation-ParameterEstimation"></a>
-# Parameter Estimation
-
-We use mean field variational inference to estimate the models. Variational
-inference can be thought of as a generalization of 
[EM](expectation-maximization.html)
- for hierarchical Bayesian models. The E-Step takes the form of, for each
-document, inferring the posterior probability of each topic for each word
-in each document. We then take the sufficient statistics and emit them in
-the form of (log) pseudo-counts for each word in each topic. The M-Step is
-simply to sum these together and (log) normalize them so that we have a
-distribution over the entire vocabulary of the corpus for each topic. 
-
-In implementation, the E-Step is implemented in the Map, and the M-Step is
-executed in the reduce step, with the final normalization happening as a
-post-processing step.
-
-<a name="LatentDirichletAllocation-References"></a>
-# References
-
-[David M. Blei, Andrew Y. Ng, Michael I. Jordan, John Lafferty. 2003. Latent 
Dirichlet Allocation. 
JMLR.](-http://machinelearning.wustl.edu/mlpapers/paper_files/BleiNJ03.pdf)
-
-[Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. PNAS. 
 ](http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf)
-
-[David Hall, Dan Jurafsky, and Christopher D. Manning. 2008. Studying the 
History of Ideas Using Topic Models 
](-http://aclweb.org/anthology//D/D08/D08-1038.pdf)

http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/algorithms/map-reduce/clustering/llr---log-likelihood-ratio.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/algorithms/map-reduce/clustering/llr---log-likelihood-ratio.md
 
b/website-old/docs/algorithms/map-reduce/clustering/llr---log-likelihood-ratio.md
deleted file mode 100644
index 300ae91..0000000
--- 
a/website-old/docs/algorithms/map-reduce/clustering/llr---log-likelihood-ratio.md
+++ /dev/null
@@ -1,46 +0,0 @@
----
-layout: algorithm
-title: (Deprecated)  LLR - Log-likelihood Ratio
-theme:
-   name: retro-mahout
----
-
-# Likelihood ratio test
-
-_Likelihood ratio test is used to compare the fit of two models one
-of which is nested within the other._
-
-In the context of machine learning and the Mahout project in particular,
-the term LLR is usually meant to refer to a test of significance for two
-binomial distributions, also known as the G squared statistic. This is a
-special case of the multinomial test and is closely related to mutual
-information.  The value of this statistic is not normally used in this
-context as a true frequentist test of significance since there would be
-obvious and dreadful problems to do with multiple comparisons, but rather
-as a heuristic score to order pairs of items with the most interestingly
-connected items having higher scores.  In this usage, the LLR has proven
-very useful for discriminating pairs of features that have interesting
-degrees of cooccurrence and those that do not with usefully small false
-positive and false negative rates.  The LLR is typically far more suitable
-in the case of small than many other measures such as Pearson's
-correlation, Pearson's chi squared statistic or z statistics.  The LLR as
-stated does not, however, make any use of rating data which can limit its
-applicability in problems such as the Netflix competition. 
-
-The actual value of the LLR is not usually very helpful other than as a way
-of ordering pairs of items.  As such, it is often used to determine a
-sparse set of coefficients to be estimated by other means such as TF-IDF. 
-Since the actual estimation of these coefficients can be done in a way that
-is independent of the training data such as by general corpus statistics,
-and since the ordering imposed by the LLR is relatively robust to counting
-fluctuation, this technique can provide very strong results in very sparse
-problems where the potential number of features vastly out-numbers the
-number of training examples and where features are highly interdependent.
-
- See Also: 
-
-* [Blog post "surprise and 
coincidence"](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html)
-* [G-Test](http://en.wikipedia.org/wiki/G-test)
-* [Likelihood Ratio Test](http://en.wikipedia.org/wiki/Likelihood-ratio_test)
-
-      
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/algorithms/map-reduce/clustering/spectral-clustering.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/algorithms/map-reduce/clustering/spectral-clustering.md 
b/website-old/docs/algorithms/map-reduce/clustering/spectral-clustering.md
deleted file mode 100644
index 0891609..0000000
--- a/website-old/docs/algorithms/map-reduce/clustering/spectral-clustering.md
+++ /dev/null
@@ -1,84 +0,0 @@
----
-layout: algorithm
-title: (Deprecated)  Spectral Clustering
-theme:
-   name: retro-mahout
----
-
-# Spectral Clustering Overview
-
-Spectral clustering, as its name implies, makes use of the spectrum (or 
eigenvalues) of the similarity matrix of the data. It examines the 
_connectedness_ of the data, whereas other clustering algorithms such as 
k-means use the _compactness_ to assign clusters. Consequently, in situations 
where k-means performs well, spectral clustering will also perform well. 
Additionally, there are situations in which k-means will underperform (e.g. 
concentric circles), but spectral clustering will be able to segment the 
underlying clusters. Spectral clustering is also very useful for image 
segmentation.
-
-At its simplest, spectral clustering relies on the following four steps:
-
- 1. Computing a similarity (or _affinity_) matrix `\(\mathbf{A}\)` from the 
data. This involves determining a pairwise distance function `\(f\)` that takes 
a pair of data points and returns a scalar.
-
- 2. Computing a graph Laplacian `\(\mathbf{L}\)` from the affinity matrix. 
There are several types of graph Laplacians; which is used will often depends 
on the situation.
-
- 3. Computing the eigenvectors and eigenvalues of `\(\mathbf{L}\)`. The degree 
of this decomposition is often modulated by `\(k\)`, or the number of clusters. 
Put another way, `\(k\)` eigenvectors and eigenvalues are computed.
-
- 4. The `\(k\)` eigenvectors are used as "proxy" data for the original 
dataset, and fed into k-means clustering. The resulting cluster assignments are 
transparently passed back to the original data.
-
-For more theoretical background on spectral clustering, such as how affinity 
matrices are computed, the different types of graph Laplacians, and whether the 
top or bottom eigenvectors and eigenvalues are computed, please read [Ulrike 
von Luxburg's article in _Statistics and Computing_ from December 
2007](http://link.springer.com/article/10.1007/s11222-007-9033-z). It provides 
an excellent description of the linear algebra operations behind spectral 
clustering, and imbues a thorough understanding of the types of situations in 
which it can be used.
-
-# Mahout Spectral Clustering
-
-As of Mahout 0.3, spectral clustering has been implemented to take advantage 
of the MapReduce framework. It uses 
[SSVD](http://mahout.apache.org/users/dim-reduction/ssvd.html) for 
dimensionality reduction of the input data set, and 
[k-means](http://mahout.apache.org/users/clustering/k-means-clustering.html) to 
perform the final clustering.
-
-**([MAHOUT-1538](https://issues.apache.org/jira/browse/MAHOUT-1538) will port 
the existing Hadoop MapReduce implementation to Mahout DSL, allowing for one of 
several distinct distributed back-ends to conduct the computation)**
-
-## Input
-
-The input format for the algorithm currently takes the form of a Hadoop-backed 
affinity matrix in the form of text files. Each line of the text file specifies 
a single element of the affinity matrix: the row index `\(i\)`, the column 
index `\(j\)`, and the value:
-
-`i, j, value`
-
-The affinity matrix is symmetric, and any unspecified `\(i, j\)` pairs are 
assumed to be 0 for sparsity. The row and column indices are 0-indexed. Thus, 
only the non-zero entries of either the upper or lower triangular need be 
specified.
-
-The matrix elements specified in the text files are collected into a Mahout 
`DistributedRowMatrix`.
-
-**([MAHOUT-1539](https://issues.apache.org/jira/browse/MAHOUT-1539) will allow 
for the creation of the affinity matrix to occur as part of the core spectral 
clustering algorithm, as opposed to the current requirement that the user 
create this matrix themselves and provide it, rather than the original data, to 
the algorithm)**
-
-## Running spectral clustering
-
-**([MAHOUT-1540](https://issues.apache.org/jira/browse/MAHOUT-1540) will 
provide a running example of this algorithm and this section will be updated to 
show how to run the example and what the expected output should be; until then, 
this section provides a how-to for simply running the algorithm on arbitrary 
input)**
-
-Spectral clustering can be invoked with the following arguments.
-
-    bin/mahout spectralkmeans \
-        -i <affinity matrix directory> \
-        -o <output working directory> \
-        -d <number of data points> \
-        -k <number of clusters AND number of top eigenvectors to use> \
-        -x <maximum number of k-means iterations>
-
-The affinity matrix can be contained in a single text file (using the 
aforementioned one-line-per-entry format) or span many text files [per 
(MAHOUT-978](https://issues.apache.org/jira/browse/MAHOUT-978), do not prefix 
text files with a leading underscore '_' or period '.'). The `-d` flag is 
required for the algorithm to know the dimensions of the affinity matrix. `-k` 
is the number of top eigenvectors from the normalized graph Laplacian in the 
SSVD step, and also the number of clusters given to k-means after the SSVD step.
-
-## Example
-
-To provide a simple example, take the following affinity matrix, contained in 
a text file called `affinity.txt`:
-
-    0, 0, 0
-    0, 1, 0.8
-    0, 2, 0.5
-    1, 0, 0.8
-    1, 1, 0
-    1, 2, 0.9
-    2, 0, 0.5
-    2, 1, 0.9
-    2, 2, 0
-
-With this 3-by-3 matrix, `-d` would be `3`. Furthermore, since all affinity 
matrices are assumed to be symmetric, the entries specifying both `1, 2, 0.9` 
and `2, 1, 0.9` are redundant; only one of these is needed. Additionally, any 
entries that are 0, such as those along the diagonal, also need not be 
specified at all. They are provided here for completeness.
-
-In general, larger values indicate a stronger "connectedness", whereas smaller 
values indicate a weaker connectedness. This will vary somewhat depending on 
the distance function used, though a common one is the [RBF 
kernel](http://en.wikipedia.org/wiki/RBF_kernel) (used in the above example) 
which returns values in the range [0, 1], where 0 indicates completely 
disconnected (or completely dissimilar) and 1 is fully connected (or identical).
-
-The call signature with this matrix could be as follows:
-
-    bin/mahout spectralkmeans \
-        -i s3://mahout-example/input/ \
-        -o s3://mahout-example/output/ \
-        -d 3 \
-        -k 2 \
-        -x 10
-
-There are many other optional arguments, in particular for tweaking the SSVD 
process (block size, number of power iterations, etc) and the k-means 
clustering step (distance measure, convergence delta, etc).
\ No newline at end of file

[26/52] [partial] mahout git commit: MAHOUT-1981 Merged site updates, fixed navbars, Mathjax

Reply via email to