[12/13] mahout git commit: WEBSITE Porting Old Website

rawkintrevo Sat, 29 Apr 2017 20:24:36 -0700

http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/docs/algorithms/map-reduce/classification/partial-implementation.md
----------------------------------------------------------------------
diff --git 
a/website/docs/algorithms/map-reduce/classification/partial-implementation.md 
b/website/docs/algorithms/map-reduce/classification/partial-implementation.md
new file mode 100644
index 0000000..2a20ccb
--- /dev/null
+++ 
b/website/docs/algorithms/map-reduce/classification/partial-implementation.md
@@ -0,0 +1,146 @@
+---
+layout: default
+title: Partial Implementation
+theme:
+    name: retro-mahout
+---
+
+
+# Classifying with random forests
+
+<a name="PartialImplementation-Introduction"></a>
+# Introduction
+
+This quick start page shows how to build a decision forest using the
+partial implementation. This tutorial also explains how to use the decision
+forest to classify new data.
+Partial Decision Forests is a mapreduce implementation where each mapper
+builds a subset of the forest using only the data available in its
+partition. This allows building forests using large datasets as long as
+each partition can be loaded in-memory.
+
+<a name="PartialImplementation-Steps"></a>
+# Steps
+<a name="PartialImplementation-Downloadthedata"></a>
+## Download the data
+* The current implementation is compatible with the UCI repository file
+format. In this example we'll use the NSL-KDD dataset because its large
+enough to show the performances of the partial implementation.
+You can download the dataset here http://nsl.cs.unb.ca/NSL-KDD/
+You can either download the full training set "KDDTrain+.ARFF", or a 20%
+subset "KDDTrain+_20Percent.ARFF" (we'll use the full dataset in this
+tutorial) and the test set "KDDTest+.ARFF".
+* Open the train and test files and remove all the lines that begin with
+'@'. All those lines are at the top of the files. Actually you can keep
+those lines somewhere, because they'll help us describe the dataset to
+Mahout
+* Put the data in HDFS: {code}
+$HADOOP_HOME/bin/hadoop fs -mkdir testdata
+$HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata{code}
+
+<a name="PartialImplementation-BuildtheJobfiles"></a>
+## Build the Job files
+* In $MAHOUT_HOME/ run: {code}mvn clean install -DskipTests{code}
+
+<a name="PartialImplementation-Generateafiledescriptorforthedataset:"></a>
+## Generate a file descriptor for the dataset: 
+run the following command:
+
+    $HADOOP_HOME/bin/hadoop jar
+$MAHOUT_HOME/core/target/mahout-core-<VERSION>-job.jar
+org.apache.mahout.classifier.df.tools.Describe -p testdata/KDDTrain+.arff
+-f testdata/KDDTrain+.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L
+
+The "N 3 C 2 N C 4 N C 8 N 2 C 19 N L" string describes all the attributes
+of the data. In this cases, it means 1 numerical(N) attribute, followed by
+3 Categorical(C) attributes, ...L indicates the label. You can also use 'I'
+to ignore some attributes
+
+<a name="PartialImplementation-Runtheexample"></a>
+## Run the example
+
+
+    $HADOOP_HOME/bin/hadoop jar
+$MAHOUT_HOME/examples/target/mahout-examples-<version>-job.jar
+org.apache.mahout.classifier.df.mapreduce.BuildForest
+-Dmapred.max.split.size=1874231 -d testdata/KDDTrain+.arff -ds
+testdata/KDDTrain+.info -sl 5 -p -t 100 -o nsl-forest
+
+which builds 100 trees (-t argument) using the partial implementation (-p).
+Each tree is built using 5 random selected attribute per node (-sl
+argument) and the example outputs the decision tree in the "nsl-forest"
+directory (-o).
+The number of partitions is controlled by the -Dmapred.max.split.size
+argument that indicates to Hadoop the max. size of each partition, in this
+case 1/10 of the size of the dataset. Thus 10 partitions will be used.
+IMPORTANT: using less partitions should give better classification results,
+but needs a lot of memory. So if the Jobs are failing, try increasing the
+number of partitions.
+* The example outputs the Build Time and the oob error estimation
+
+
+    10/03/13 17:57:29 INFO mapreduce.BuildForest: Build Time: 0h 7m 43s 582
+    10/03/13 17:57:33 INFO mapreduce.BuildForest: oob error estimate :
+0.002325895231517865
+    10/03/13 17:57:33 INFO mapreduce.BuildForest: Storing the forest in:
+nsl-forest/forest.seq
+
+
+<a name="PartialImplementation-UsingtheDecisionForesttoClassifynewdata"></a>
+## Using the Decision Forest to Classify new data
+run the following command:
+
+    $HADOOP_HOME/bin/hadoop jar
+$MAHOUT_HOME/examples/target/mahout-examples-<version>-job.jar
+org.apache.mahout.classifier.df.mapreduce.TestForest -i
+nsl-kdd/KDDTest+.arff -ds nsl-kdd/KDDTrain+.info -m nsl-forest -a -mr -o
+predictions
+
+This will compute the predictions of "KDDTest+.arff" dataset (-i argument)
+using the same data descriptor generated for the training dataset (-ds) and
+the decision forest built previously (-m). Optionally (if the test dataset
+contains the labels of the tuples) run the analyzer to compute the
+confusion matrix (-a), and you can also store the predictions in a text
+file or a directory of text files(-o). Passing the (-mr) parameter will use
+Hadoop to distribute the classification.
+
+* The example should output the classification time and the confusion
+matrix
+
+
+    10/03/13 18:08:56 INFO mapreduce.TestForest: Classification Time: 0h 0m 6s
+355
+    10/03/13 18:08:56 INFO mapreduce.TestForest:
+=======================================================
+    Summary
+    -------------------------------------------------------
+    Correctly Classified Instances             :      17657       78.3224%
+    Incorrectly Classified Instances   :       4887       21.6776%
+    Total Classified Instances         :      22544
+    
+    =======================================================
+    Confusion Matrix
+    -------------------------------------------------------
+    a  b       <--Classified as
+    9459       252      |  9711        a     = normal
+    4635       8198     |  12833       b     = anomaly
+    Default Category: unknown: 2
+
+
+If the input is a single file then the output will be a single text file,
+in the above example 'predictions' would be one single file. If the input
+if a directory containing for example two files 'a.data' and 'b.data', then
+the output will be a directory 'predictions' containing two files
+'a.data.out' and 'b.data.out'
+
+<a name="PartialImplementation-KnownIssuesandlimitations"></a>
+## Known Issues and limitations
+The "Decision Forest" code is still "a work in progress", many features are
+still missing. Here is a list of some known issues:
+* For now, the training does not support multiple input files. The input
+dataset must be one single file (this support will be available with the 
upcoming release). 
+Classifying new data does support multiple
+input files.
+* The tree building is done when each mapper.close() method is called.
+Because the mappers don't refresh their state, the job can fail when the
+dataset is big and you try to build a large number of trees.


http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/docs/algorithms/map-reduce/classification/random-forests.md
----------------------------------------------------------------------
diff --git 
a/website/docs/algorithms/map-reduce/classification/random-forests.md 
b/website/docs/algorithms/map-reduce/classification/random-forests.md
new file mode 100644
index 0000000..c8b1a47
--- /dev/null
+++ b/website/docs/algorithms/map-reduce/classification/random-forests.md
@@ -0,0 +1,234 @@
+---
+layout: default
+title: Random Forests
+theme:
+    name: retro-mahout
+---
+
+<a name="RandomForests-HowtogrowaDecisionTree"></a>
+### How to grow a Decision Tree
+
+source : \[3\](3\.html)
+
+LearnUnprunedTree(*X*,*Y*)
+
+Input: *X* a matrix of *R* rows and *M* columns where *X{*}{*}{~}ij{~}* =
+the value of the *j*'th attribute in the *i*'th input datapoint. Each
+column consists of either all real values or all categorical values.
+Input: *Y* a vector of *R* elements, where *Y{*}{*}{~}i{~}* = the output
+class of the *i*'th datapoint. The *Y{*}{*}{~}i{~}* values are categorical.
+Output: An Unpruned decision tree
+
+
+If all records in *X* have identical values in all their attributes (this
+includes the case where *R<2*), return a Leaf Node predicting the majority
+output, breaking ties randomly. This case also includes
+If all values in *Y* are the same, return a Leaf Node predicting this value
+as the output
+Else
+&nbsp;&nbsp;&nbsp; select *m* variables at random out of the *M* variables
+&nbsp;&nbsp;&nbsp; For *j* = 1 .. *m*
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; If *j*'th attribute is
+categorical
+*&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+IG{*}{*}{~}j{~}* = IG(*Y*\|*X{*}{*}{~}j{~}*) (see Information
+Gain)&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Else (*j*'th attribute is
+real-valued)
+*&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+IG{*}{*}{~}j{~}* = IG*(*Y*\|*X{*}{*}{~}j{~}*) (see Information Gain)
+&nbsp;&nbsp;&nbsp; Let *j\** = argmax{~}j~ *IG{*}{*}{~}j{~}* (this is the
+splitting attribute we'll use)
+&nbsp;&nbsp;&nbsp; If *j\** is categorical then
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; For each value *v* of the *j*'th
+attribute
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let
+*X{*}{*}{^}v{^}* = subset of rows of *X* in which *X{*}{*}{~}ij{~}* = *v*.
+Let *Y{*}{*}{^}v{^}* = corresponding subset of *Y*
+&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Let *Child{*}{*}{^}v{^}* =
+LearnUnprunedTree(*X{*}{*}{^}v{^}*,*Y{*}{*}{^}v{^}*)
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Return a decision tree node,
+splitting on *j*'th attribute. The number of children equals the number of
+values of the *j*'th attribute, and the *v*'th child is
+*Child{*}{*}{^}v{^}*
+&nbsp;&nbsp;&nbsp; Else *j\** is real-valued and let *t* be the best split
+threshold
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let *X{*}{*}{^}LO{^}* = subset
+of rows of *X* in which *X{*}{*}{~}ij{~}* *<= t*. Let *Y{*}{*}{^}LO{^}* =
+corresponding subset of *Y*
+&nbsp; &nbsp; &nbsp; &nbsp; Let *Child{*}{*}{^}LO{^}* =
+LearnUnprunedTree(*X{*}{*}{^}LO{^}*,*Y{*}{*}{^}LO{^}*)
+&nbsp; &nbsp; &nbsp; &nbsp; Let *X{*}{*}{^}HI{^}* = subset of rows of *X*
+in which *X{*}{*}{~}ij{~}* *> t*. Let *Y{*}{*}{^}HI{^}* = corresponding
+subset of *Y*
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let *Child{*}{*}{^}HI{^}* =
+LearnUnprunedTree(*X{*}{*}{^}HI{^}*,*Y{*}{*}{^}HI{^}*)
+&nbsp; &nbsp; &nbsp; &nbsp; Return a decision tree node, splitting on
+*j*'th attribute. It has two children corresponding to whether the *j*'th
+attribute is above or below the given threshold.
+
+*Note*: There are alternatives to Information Gain for splitting nodes
+&nbsp;
+
+<a name="RandomForests-Informationgain"></a>
+### Information gain
+
+source : \[3\](3\.html)
+1. h4. nominal attributes
+
+suppose X can have one of m values V{~}1~,V{~}2~,...,V{~}m~
+P(X=V{~}1~)=p{~}1~, P(X=V{~}2~)=p{~}2~,...,P(X=V{~}m~)=p{~}m~
+&nbsp;
+H(X)= \-sum{~}j=1{~}{^}m^ p{~}j~ log{~}2~ p{~}j~ (The entropy of X)
+H(Y\|X=v) = the entropy of Y among only those records in which X has value
+v
+H(Y\|X) = sum{~}j~ p{~}j~ H(Y\|X=v{~}j~)
+IG(Y\|X) = H(Y) - H(Y\|X)
+1. h4. real-valued attributes
+
+suppose X is real valued
+define IG(Y\|X:t) as H(Y) - H(Y\|X:t)
+define H(Y\|X:t) = H(Y\|X<t) P(X<t) + H(Y\|X>=t) P(X>=t)
+define IG*(Y\|X) = max{~}t~ IG(Y\|X:t)
+
+<a name="RandomForests-HowtogrowaRandomForest"></a>
+### How to grow a Random Forest
+
+source : \[1\](1\.html)
+
+Each tree is grown as follows:
+1. if the number of cases in the training set is *N*, sample *N* cases at
+random \-but with replacement, from the original data. This sample will be
+the training set for the growing tree.
+1. if there are *M* input variables, a number *m << M* is specified such
+that at each node, *m* variables are selected at random out of the *M* and
+the best split on these *m* is used to split the node. The value of *m* is
+held constant during the forest growing.
+1. each tree is grown to its large extent possible. There is no pruning.
+
+<a name="RandomForests-RandomForestparameters"></a>
+### Random Forest parameters
+
+source : \[2\](2\.html)
+Random Forests are easy to use, the only 2 parameters a user of the
+technique has to determine are the number of trees to be used and the
+number of variables (*m*) to be randomly selected from the available set of
+variables.
+Breinman's recommendations are to pick a large number of trees, as well as
+the square root of the number of variables for *m*.
+&nbsp;
+
+<a name="RandomForests-Howtopredictthelabelofacase"></a>
+### How to predict the label of a case
+
+Classify(*node*,*V*)
+&nbsp;&nbsp;&nbsp; Input: *node* from the decision tree, if *node.attribute
+= j* then the split is done on the *j*'th attribute
+
+&nbsp;&nbsp; &nbsp;Input: *V* a vector of *M* columns where
+*V{*}{*}{~}j{~}* = the value of the *j*'th attribute.
+&nbsp;&nbsp;&nbsp; Output: label of *V*
+
+&nbsp;&nbsp;&nbsp; If *node* is a Leaf then
+&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; Return the value predicted
+by *node*
+
+&nbsp;&nbsp; &nbsp;Else
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let *j =
+node.attribute*
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; If *j* is
+categorical then
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+Let *v* = *V{*}{*}{~}j{~}*
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+Let *child{*}{*}{^}v{^}* = child node corresponding to the attribute's
+value *v*
+&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp; Return Classify(*child{*}{*}{^}v{^}*,*V*)
+
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Else *j* is
+real-valued
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+Let *t = node.threshold* (split threshold)
+&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp; If Vj < t then
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
+&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;&nbsp; Let *child{*}{*}{^}LO{^}* = child
+node corresponding to (*<t*)
+&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; Return
+Classify(*child{*}{*}{^}LO{^}*,*V*)
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+Else
+&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; Let *child{*}{*}{^}HI{^}* =
+child node corresponding to (*>=t*)
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp; &nbsp;
+&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp;&nbsp; Return
+Classify(*child{*}{*}{^}HI{^}*,*V*)
+&nbsp;
+
+<a name="RandomForests-Theoutofbag(oob)errorestimation"></a>
+### The out of bag (oob) error estimation
+
+source : \[1\](1\.html)
+
+in random forests, there is no need for cross-validation or a separate test
+set to get an unbiased estimate of the test set error. It is estimated
+internally, during the run, as follows:
+* each tree is constructed using a different bootstrap sample from the
+original data. About one-third of the cases left of the bootstrap sample
+and not used in the construction of the _kth_ tree.
+* put each case left out in the construction of the _kth_ tree down the
+_kth{_}tree to get a classification. In this way, a test set classification
+is obtained for each case in about one-thrid of the trees. At the end of
+the run, take *j* to be the class that got most of the the votes every time
+case *n* was _oob_. The proportion of times that *j* is not equal to the
+true class of *n* averaged over all cases is the _oob error estimate_. This
+has proven to be unbiased in many tests.
+
+<a name="RandomForests-OtherRFuses"></a>
+### Other RF uses
+
+source : \[1\](1\.html)
+* variable importance
+* gini importance
+* proximities
+* scaling
+* prototypes
+* missing values replacement for the training set
+* missing values replacement for the test set
+* detecting mislabeled cases
+* detecting outliers
+* detecting novelties
+* unsupervised learning
+* balancing prediction error
+Please refer to \[1\](1\.html)
+ for a detailed description
+
+<a name="RandomForests-References"></a>
+### References
+
+\[1\](1\.html)
+&nbsp; Random Forests - Classification Description
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; 
&nbsp;[http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm](http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm)
+\[2\](2\.html)
+&nbsp; B. Lariviï¿½re & D. Van Den Poel, 2004. "Predicting Customer Retention
+and Profitability by Using Random Forests and Regression Forests
+Techniques,"
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Working Papers of Faculty of
+Economics and Business Administration, Ghent University, Belgium 04/282,
+Ghent University,
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Faculty of Economics and
+Business Administration.
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Available online : 
[http://ideas.repec.org/p/rug/rugwps/04-282.html](http://ideas.repec.org/p/rug/rugwps/04-282.html)
+\[3\](3\.html)
+&nbsp; Decision Trees - Andrew W. Moore\[4\]
+&nbsp; &nbsp; &nbsp; &nbsp; http://www.cs.cmu.edu/~awm/tutorials\[1\](1\.html)
+\[4\](4\.html)
+&nbsp; Information Gain - Andrew W. Moore
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
[http://www.cs.cmu.edu/~awm/tutorials](http://www.cs.cmu.edu/~awm/tutorials)

http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/docs/algorithms/map-reduce/classification/restricted-boltzmann-machines.md
----------------------------------------------------------------------
diff --git 
a/website/docs/algorithms/map-reduce/classification/restricted-boltzmann-machines.md
 
b/website/docs/algorithms/map-reduce/classification/restricted-boltzmann-machines.md
new file mode 100644
index 0000000..0aa8641
--- /dev/null
+++ 
b/website/docs/algorithms/map-reduce/classification/restricted-boltzmann-machines.md
@@ -0,0 +1,49 @@
+---
+layout: default
+title: Restricted Boltzmann Machines
+theme:
+    name: retro-mahout
+---
+
+NOTE: This implementation is a Work-In-Progress, at least till September,
+2010. 
+
+The JIRA issue is [here](https://issues.apache.org/jira/browse/MAHOUT-375)
+. 
+
+<a name="RestrictedBoltzmannMachines-BoltzmannMachines"></a>
+### Boltzmann Machines
+Boltzmann Machines are a type of stochastic neural networks that closely
+resemble physical processes. They define a network of units with an overall
+energy that is evolved over a period of time, until it reaches thermal
+equilibrium. 
+
+However, the convergence speed of Boltzmann machines that have
+unconstrained connectivity is low.
+
+<a name="RestrictedBoltzmannMachines-RestrictedBoltzmannMachines"></a>
+### Restricted Boltzmann Machines
+Restricted Boltzmann Machines are a variant, that are 'restricted' in the
+sense that connections between hidden units of a single layer are _not_
+allowed. In addition, stacking multiple RBM's is also feasible, with the
+activities of the hidden units forming the base for a higher-level RBM. The
+combination of these two features renders RBM's highly usable for
+parallelization. 
+
+In the Netflix Prize, RBM's offered distinctly orthogonal predictions to
+SVD and k-NN approaches, and contributed immensely to the final solution.
+
+<a name="RestrictedBoltzmannMachines-RBM'sinApacheMahout"></a>
+### RBM's in Apache Mahout
+An implementation of Restricted Boltzmann Machines is being developed for
+Apache Mahout as a Google Summer of Code 2010 project. A recommender
+interface will also be provided. The key aims of the implementation are:
+1. Accurate - should replicate known results, including those of the Netflix
+Prize
+1. Fast - The implementation uses Map-Reduce, hence, it should be fast
+1. Scale - Should scale to large datasets, with a design whose critical
+parts don't need a dependency between the amount of memory on your cluster
+systems and the size of your dataset
+
+You can view the patch as it develops 
[here](http://github.com/sisirkoppaka/mahout-rbm/compare/trunk...rbm)
+.

http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/docs/algorithms/map-reduce/classification/support-vector-machines.md
----------------------------------------------------------------------
diff --git 
a/website/docs/algorithms/map-reduce/classification/support-vector-machines.md 
b/website/docs/algorithms/map-reduce/classification/support-vector-machines.md
new file mode 100644
index 0000000..6d1b9df
--- /dev/null
+++ 
b/website/docs/algorithms/map-reduce/classification/support-vector-machines.md
@@ -0,0 +1,43 @@
+---
+layout: default
+title: Support Vector Machines
+theme:
+    name: retro-mahout
+---
+
+<a name="SupportVectorMachines-SupportVectorMachines"></a>
+# Support Vector Machines
+
+As with Naive Bayes, Support Vector Machines (or SVMs in short) can be used
+to solve the task of assigning objects to classes. However, the way this
+task is solved is completely different to the setting in Naive Bayes.
+
+Each object is considered to be a point in _n_ dimensional feature space,
+_n_ being the number of features used to describe the objects numerically.
+In addition each object is assigned a binary label, let us assume the
+labels are "positive" and "negative". During learning, the algorithm tries
+to find a hyperplane in that space, that perfectly separates positive from
+negative objects.
+It is trivial to think of settings where this might very well be
+impossible. To remedy this situation, objects can be assigned so called
+slack terms, that punish mistakes made during learning appropriately. That
+way, the algorithm is forced to find the hyperplane that causes the least
+number of mistakes.
+
+Another way to overcome the problem of there being no linear hyperplane to
+separate positive from negative objects is to simply project each feature
+vector into an higher dimensional feature space and search for a linear
+separating hyperplane in that new space. Usually the main problem with
+learning in high dimensional feature spaces is the so called curse of
+dimensionality. That is, there are fewer learning examples available than
+free parameters to tune. In the case of SVMs this problem is less
+detrimental, as SVMs impose additional structural constraints on their
+solutions. Each separating hyperplane needs to have a maximal margin to all
+training examples. In addition, that way, the solution may be based on the
+information encoded in only very few examples.
+
+<a name="SupportVectorMachines-Strategyforparallelization"></a>
+## Strategy for parallelization
+
+<a name="SupportVectorMachines-Designofpackages"></a>
+## Design of packages

http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/docs/algorithms/map-reduce/index.md
----------------------------------------------------------------------
diff --git a/website/docs/algorithms/map-reduce/index.md 
b/website/docs/algorithms/map-reduce/index.md
new file mode 100644
index 0000000..0e55a79
--- /dev/null
+++ b/website/docs/algorithms/map-reduce/index.md
@@ -0,0 +1,42 @@
+---
+layout: page
+title: Deprecated Map Reduce Algorithms
+theme:
+    name: mahout2
+---
+
+### Classification
+
+[Bayesian](classification/bayesian.html)
+
+[Class Discovery](classification/class-discovery.html)
+
+[Classifying Your Data](classification/classifyingyourdata.html)
+
+[Collocations](classification/collocations.html)
+
+[Gaussian Discriminative 
Analysis](classification/gaussian-discriminative-analysis.html)
+
+[Hidden Markov Models](classification/hidden-markov-models.html)
+
+[Independent Component 
Analysis](classification/independent-component-analysis.html)
+
+[Locally Weighted Linear 
Regression](classification/locally-weighted-linear-regression.html)
+
+[Logistic Regression](classification/logistic-regression.html)
+
+[Mahout Collections](classification/mahout-collections.html)
+
+[Multilayer Perceptron](classification/mlp.html)
+
+[Naive Bayes](classification/naivebayes.html)
+
+[Neural Network](classification/neural-network.html)
+
+[Partial Implementation](classification/partial-implementation.html)
+
+[Random Forrests](classification/random-forrests.html)
+
+[Restricted Boltzman 
Machines](classification/restricted-boltzman-machines.html)
+
+[Support Vector Machines](classification/support-vector-machines.html)

http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/docs/distributed/spark-bindings.md
----------------------------------------------------------------------
diff --git a/website/docs/distributed/spark-bindings.md 
b/website/docs/distributed/spark-bindings.md
deleted file mode 100644
index 97f00c9..0000000
--- a/website/docs/distributed/spark-bindings.md
+++ /dev/null
@@ -1,101 +0,0 @@
----
-layout: page
-title: Mahout Samsara Spark Bindings
-theme:
-    name: mahout2
----
-
-# Scala & Spark Bindings:
-*Bringing algebraic semantics*
-
-## What is Scala & Spark Bindings?
-
-In short, Scala & Spark Bindings for Mahout is Scala DSL and algebraic 
optimizer of something like this (actual formula from **(d)spca**)
-        
-
-`\[\mathbf{G}=\mathbf{B}\mathbf{B}^{\top}-\mathbf{C}-\mathbf{C}^{\top}+\mathbf{s}_{q}\mathbf{s}_{q}^{\top}\boldsymbol{\xi}^{\top}\boldsymbol{\xi}\]`
-
-bound to in-core and distributed computations (currently, on Apache Spark).
-
-
-Mahout Scala & Spark Bindings expression of the above:
-
-        val g = bt.t %*% bt - c - c.t + (s_q cross s_q) * (xi dot xi)
-
-The main idea is that a scientist writing algebraic expressions cannot care 
less of distributed 
-operation plans and works **entirely on the logical level** just like he or 
she would do with R.
-
-Another idea is decoupling logical expression from distributed back-end. As 
more back-ends are added, 
-this implies **"write once, run everywhere"**.
-
-The linear algebra side works with scalars, in-core vectors and matrices, and 
Mahout Distributed
-Row Matrices (DRMs).
-
-The ecosystem of operators is built in the R's image, i.e. it follows R naming 
such as %*%, 
-colSums, nrow, length operating over vectors or matices. 
-
-Important part of Spark Bindings is expression optimizer. It looks at 
expression as a whole 
-and figures out how it can be simplified, and which physical operators should 
be picked. For example,
-there are currently about 5 different physical operators performing DRM-DRM 
multiplication
-picked based on matrix geometry, distributed dataset partitioning, orientation 
etc. 
-If we count in DRM by in-core combinations, that would be another 4, i.e. 9 
total -- all of it for just 
-simple x %*% y logical notation.
-
-
-
-Please refer to the documentation for details.
-
-## Status
-
-This environment addresses mostly R-like Linear Algebra optmizations for 
-Spark, Flink and H20.
-
-
-## Documentation
-
-* Scala and Spark bindings manual: 
[web](http://apache.github.io/mahout/doc/ScalaSparkBindings.html), 
[pdf](ScalaSparkBindings.pdf)
-* Overview blog on 0.10.x releases: 
[blog](http://www.weatheringthroughtechdays.com/2015/04/mahout-010x-first-mahout-release-as.html)
-
-## Distributed methods and solvers using Bindings
-
-* In-core ([ssvd]) and Distributed ([dssvd]) Stochastic SVD -- guinea pigs -- 
see the bindings manual
-* In-core ([spca]) and Distributed ([dspca]) Stochastic PCA -- guinea pigs -- 
see the bindings manual
-* Distributed thin QR decomposition ([dqrThin]) -- guinea pig -- see the 
bindings manual 
-* [Current list of 
algorithms](https://mahout.apache.org/users/basics/algorithms.html)
-
-[ssvd]: 
https://github.com/apache/mahout/blob/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala
-[spca]: 
https://github.com/apache/mahout/blob/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala
-[dssvd]: 
https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DSSVD.scala
-[dspca]: 
https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DSPCA.scala
-[dqrThin]: 
https://github.com/apache/mahout/blob/trunk/spark/src/main/scala/org/apache/mahout/sparkbindings/decompositions/DQR.scala
-
-
-## Related history of note 
-
-* CLI and Driver for Spark version of item similarity -- 
[MAHOUT-1541](https://issues.apache.org/jira/browse/MAHOUT-1541)
-* Command line interface for generalizable Spark pipelines -- 
[MAHOUT-1569](https://issues.apache.org/jira/browse/MAHOUT-1569)
-* Cooccurrence Analysis / Item-based Recommendation -- 
[MAHOUT-1464](https://issues.apache.org/jira/browse/MAHOUT-1464)
-* Spark Bindings -- 
[MAHOUT-1346](https://issues.apache.org/jira/browse/MAHOUT-1346)
-* Scala Bindings -- 
[MAHOUT-1297](https://issues.apache.org/jira/browse/MAHOUT-1297)
-* Interactive Scala & Spark Bindings Shell & Script processor -- 
[MAHOUT-1489](https://issues.apache.org/jira/browse/MAHOUT-1489)
-* OLS tutorial using Mahout shell -- 
[MAHOUT-1542](https://issues.apache.org/jira/browse/MAHOUT-1542)
-* Full abstraction of DRM apis and algorithms from a distributed engine -- 
[MAHOUT-1529](https://issues.apache.org/jira/browse/MAHOUT-1529)
-* Port Naive Bayes -- 
[MAHOUT-1493](https://issues.apache.org/jira/browse/MAHOUT-1493)
-
-## Work in progress 
-* Text-delimited files for input and output -- 
[MAHOUT-1568](https://issues.apache.org/jira/browse/MAHOUT-1568)
-<!-- * Weighted (Implicit Feedback) ALS -- 
[MAHOUT-1365](https://issues.apache.org/jira/browse/MAHOUT-1365) -->
-<!--* Data frame R-like bindings -- 
[MAHOUT-1490](https://issues.apache.org/jira/browse/MAHOUT-1490) -->
-
-* *Your issue here!*
-
-<!-- ## Stuff wanted: 
-* Data frame R-like bindings (similarly to linalg bindings)
-* Stat R-like bindings (perhaps we can just adapt to commons.math stat)
-* **BYODMs:** Bring Your Own Distributed Method on SparkBindings! 
-* In-core jBlas matrix adapter
-* In-core GPU matrix adapters -->
-
-
-
-  
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/docs/distributed/spark-bindings/MahoutScalaAndSparkBindings.pptx
----------------------------------------------------------------------
diff --git 
a/website/docs/distributed/spark-bindings/MahoutScalaAndSparkBindings.pptx 
b/website/docs/distributed/spark-bindings/MahoutScalaAndSparkBindings.pptx
new file mode 100644
index 0000000..ec1de04
Binary files /dev/null and 
b/website/docs/distributed/spark-bindings/MahoutScalaAndSparkBindings.pptx 
differ

[12/13] mahout git commit: WEBSITE Porting Old Website

Reply via email to