recommender...

apalumbo Thu, 19 Mar 2015 14:21:52 -0700

Author: apalumbo
Date: Thu Mar 19 21:21:28 2015
New Revision: 1667878

URL: http://svn.apache.org/r1667878
Log:
Moved Classification, clustering and Recommender directories into a new 
MapReduce directory


Added:
    mahout/site/mahout_cms/trunk/content/users/algorithms/
    mahout/site/mahout_cms/trunk/content/users/environment/
    mahout/site/mahout_cms/trunk/content/users/mapreduce/
    mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/bankmarketing-example.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/bayesian-commandline.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/bayesian.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/breiman-example.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/class-discovery.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/classifyingyourdata.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/hidden-markov-models.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/locally-weighted-linear-regression.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/logistic-regression.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/naivebayes.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/neural-network.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/partial-implementation.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/random-forests.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/restricted-boltzmann-machines.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/support-vector-machines.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/twenty-newsgroups.mdtext
    mahout/site/mahout_cms/trunk/content/users/mapreduce/clustering/
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/clustering/20newsgroups.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/clustering/canopy-clustering.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/clustering/canopy-commandline.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/clustering/cluster-dumper.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/clustering/clustering-of-synthetic-control-data.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/clustering/clustering-seinfeld-episodes.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/clustering/clusteringyourdata.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/clustering/expectation-maximization.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/clustering/fuzzy-k-means-commandline.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/clustering/fuzzy-k-means.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/clustering/hierarchical-clustering.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/clustering/k-means-clustering.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/clustering/k-means-commandline.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/clustering/latent-dirichlet-allocation.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/clustering/lda-commandline.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/clustering/llr---log-likelihood-ratio.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/clustering/spectral-clustering.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/clustering/streaming-k-means.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/clustering/viewing-result.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/clustering/viewing-results.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/clustering/visualizing-sample-clusters.mdtext
    mahout/site/mahout_cms/trunk/content/users/mapreduce/recommender/
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/recommender/intro-als-hadoop.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/recommender/intro-cooccurrence-spark.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/recommender/intro-itembased-hadoop.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/recommender/matrix-factorization.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/recommender/quickstart.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/recommender/recommender-documentation.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/recommender/recommender-first-timer-faq.mdtext
    
mahout/site/mahout_cms/trunk/content/users/mapreduce/recommender/userbased-5-minutes.mdtext
Modified:
    mahout/site/mahout_cms/trunk/templates/standard.html

Added: 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/bankmarketing-example.mdtext
URL: 
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/bankmarketing-example.mdtext?rev=1667878&view=auto
==============================================================================
--- 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/bankmarketing-example.mdtext
 (added)
+++ 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/bankmarketing-example.mdtext
 Thu Mar 19 21:21:28 2015
@@ -0,0 +1,47 @@
+Title:
+Notice:    Licensed to the Apache Software Foundation (ASF) under one
+           or more contributor license agreements.  See the NOTICE file
+           distributed with this work for additional information
+           regarding copyright ownership.  The ASF licenses this file
+           to you under the Apache License, Version 2.0 (the
+           "License"); you may not use this file except in compliance
+           with the License.  You may obtain a copy of the License at
+           .
+             http://www.apache.org/licenses/LICENSE-2.0
+           .
+           Unless required by applicable law or agreed to in writing,
+           software distributed under the License is distributed on an
+           "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+           KIND, either express or implied.  See the License for the
+           specific language governing permissions and limitations
+           under the License.
+
+#Bank Marketing Example
+
+### Introduction
+
+This page describes how to run Mahout's SGD classifier on the [UCI Bank 
Marketing dataset](http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing).
+The goal is to predict if the client will subscribe a term deposit offered via 
a phone call. The features in the dataset consist
+of information such as age, job, marital status as well as information about 
the last contacts from the bank.
+
+### Code & Data
+
+The bank marketing example code lives under 
+
+*mahout-examples/src/main/java/org.apache.mahout.classifier.sgd.bankmarketing*
+
+The data can be found at 
+
+*mahout-examples/src/main/resources/bank-full.csv*
+
+### Code details
+
+This example consists of 3 classes:
+
+  - BankMarketingClassificationMain
+  - TelephoneCall
+  - TelephoneCallParser
+
+When you run the main method of BankMarketingClassificationMain it parses the 
dataset using the TelephoneCallParser and trains
+a logistic regression model with 20 runs and 20 passes. The 
TelephoneCallParser uses Mahout's feature vector encoder
+to encode the features in the dataset into a vector. Afterwards the model is 
tested and the learning rate and AUC is printed accuracy is printed to standard 
output.
\ No newline at end of file

Added: 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/bayesian-commandline.mdtext
URL: 
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/bayesian-commandline.mdtext?rev=1667878&view=auto
==============================================================================
--- 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/bayesian-commandline.mdtext
 (added)
+++ 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/bayesian-commandline.mdtext
 Thu Mar 19 21:21:28 2015
@@ -0,0 +1,59 @@
+Title: bayesian-commandline
+
+# Naive Bayes commandline documentation
+
+<a name="bayesian-commandline-Introduction"></a>
+## Introduction
+
+This quick start page describes how to run the naive bayesian and
+complementary naive bayesian classification algorithms on a Hadoop cluster.
+
+<a name="bayesian-commandline-Steps"></a>
+## Steps
+
+<a name="bayesian-commandline-Testingitononesinglemachinew/ocluster"></a>
+### Testing it on one single machine w/o cluster
+
+In the examples directory type:
+
+    mvn -q exec:java
+        
-Dexec.mainClass="org.apache.mahout.classifier.bayes.mapreduce.bayes.<JOB>"
+        -Dexec.args="<OPTIONS>"
+
+    mvn -q exec:java
+        
-Dexec.mainClass="org.apache.mahout.classifier.bayes.mapreduce.cbayes.<JOB>"
+        -Dexec.args="<OPTIONS>"
+
+
+<a name="bayesian-commandline-Runningitonthecluster"></a>
+### Running it on the cluster
+
+* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.1 release, the
+job will be mahout-core-0.1.jar
+
+* (Optional) 1 Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+
+* Run the Job: $HADOOP_HOME/bin/hadoop jar
+
+    $MAHOUT_HOME/core/target/mahout-core-<MAHOUT VERSION>.job
+        org.apache.mahout.classifier.bayes.mapreduce.bayes.BayesDriver 
<OPTIONS>
+
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="bayesian-commandline-Commandlineoptions"></a>
+## Command line options
+
+    BayesDriver, BayesThetaNormalizerDriver, CBayesNormalizedWeightDriver, 
CBayesDriver, CBayesThetaDriver, CBayesThetaNormalizerDriver, 
BayesWeightSummerDriver, BayesFeatureDriver, BayesTfIdfDriver Usage:
+        [--input <input> --output <output> --help]
+      
+    Options
+    
+      --input (-i) input         The Path for input Vectors. Must be a 
SequenceFile of Writable, Vector.
+      --output (-o) output       The directory pathname for output points.
+      --help (-h)                Print out help.
+

Added: 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/bayesian.mdtext
URL: 
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/bayesian.mdtext?rev=1667878&view=auto
==============================================================================
--- 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/bayesian.mdtext
 (added)
+++ 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/bayesian.mdtext
 Thu Mar 19 21:21:28 2015
@@ -0,0 +1,144 @@
+# Naive Bayes
+
+
+## Intro
+
+Mahout currently has two Naive Bayes implementations.  The first is standard 
Multinomial Naive Bayes. The second is an implementation of Transformed 
Weight-normalized Complement Naive Bayes as introduced by Rennie et al. 
[[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf). We refer to 
the former as Bayes and the latter as CBayes.
+
+Where Bayes has long been a standard in text classification, CBayes is an 
extension of Bayes that performs particularly well on datasets with skewed 
classes and has been shown to be competitive with algorithms of higher 
complexity such as Support Vector Machines. 
+
+
+## Implementations
+Both Bayes and CBayes are currently trained via MapReduce Jobs. Testing and 
classification can be done via a MapReduce Job or sequentially.  Mahout 
provides CLI drivers for preprocessing, training and testing. A Spark 
implementation is currently in the works 
([MAHOUT-1493](https://issues.apache.org/jira/browse/MAHOUT-1493)).
+
+## Preprocessing and Algorithm
+
+As described in 
[[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf) Mahout Naive 
Bayes is broken down into the following steps (assignments are over all 
possible index values):  
+
+- Let `\(\vec{d}=(\vec{d_1},...,\vec{d_n})\)` be a set of documents; 
`\(d_{ij}\)` is the count of word `\(i\)` in document `\(j\)`.
+- Let `\(\vec{y}=(y_1,...,y_n)\)` be their labels.
+- Let `\(\alpha_i\)` be a smoothing parameter for all words in the vocabulary; 
let `\(\alpha=\sum_i{\alpha_i}\)`. 
+- **Preprocessing**(via seq2Sparse) TF-IDF transformation and L2 length 
normalization of `\(\vec{d}\)`
+    1. `\(d_{ij} = \sqrt{d_{ij}}\)` 
+    2. `\(d_{ij} = 
d_{ij}\left(\log{\frac{\sum_k1}{\sum_k\delta_{ik}+1}}+1\right)\)` 
+    3. `\(d_{ij} =\frac{d_{ij}}{\sqrt{\sum_k{d_{kj}^2}}}\)` 
+- **Training: Bayes**`\((\vec{d},\vec{y})\)` calculate term weights 
`\(w_{ci}\)` as:
+    1. `\(\hat\theta_{ci}=\frac{d_{ic}+\alpha_i}{\sum_k{d_{kc}}+\alpha}\)`
+    2. `\(w_{ci}=\log{\hat\theta_{ci}}\)`
+- **Training: CBayes**`\((\vec{d},\vec{y})\)` calculate term weights 
`\(w_{ci}\)` as:
+    1. `\(\hat\theta_{ci} = \frac{\sum_{j:y_j\neq 
c}d_{ij}+\alpha_i}{\sum_{j:y_j\neq c}{\sum_k{d_{kj}}}+\alpha}\)`
+    2. `\(w_{ci}=-\log{\hat\theta_{ci}}\)`
+    3. `\(w_{ci}=\frac{w_{ci}}{\sum_i \lvert w_{ci}\rvert}\)`
+- **Label Assignment/Testing:**
+    1. Let `\(\vec{t}= (t_1,...,t_n)\)` be a test document; let `\(t_i\)` be 
the count of the word `\(t\)`.
+    2. Label the document according to `\(l(t)=\arg\max_c \sum\limits_{i} t_i 
w_{ci}\)`
+
+As we can see, the main difference between Bayes and CBayes is the weight 
calculation step.  Where Bayes weighs terms more heavily based on the 
likelihood that they belong to class `\(c\)`, CBayes seeks to maximize term 
weights on the likelihood that they do not belong to any other class.  
+
+## Running from the command line
+
+Mahout provides CLI drivers for all above steps.  Here we will give a simple 
overview of Mahout CLI commands used to preprocess the data, train the model 
and assign labels to the training set. An [example 
script](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh)
 is given for the full process from data acquisition through classification of 
the classic [20 Newsgroups 
corpus](https://mahout.apache.org/users/classification/twenty-newsgroups.html). 
 
+
+- **Preprocessing:**
+For a set of Sequence File Formatted documents in PATH_TO_SEQUENCE_FILES the 
[mahout 
seq2sparse](https://mahout.apache.org/users/basics/creating-vectors-from-text.html)
 command performs the TF-IDF transformations (-wt tfidf option) and L2 length 
normalization (-n 2 option) as follows:
+
+        mahout seq2sparse 
+          -i ${PATH_TO_SEQUENCE_FILES} 
+          -o ${PATH_TO_TFIDF_VECTORS} 
+          -nv 
+          -n 2
+          -wt tfidf
+
+- **Training:**
+The model is then trained using `mahout trainnb` .  The default is to train a 
Bayes model. The -c option is given to train a CBayes model:
+
+        mahout trainnb
+          -i ${PATH_TO_TFIDF_VECTORS} 
+          -el 
+          -o ${PATH_TO_MODEL}/model 
+          -li ${PATH_TO_MODEL}/labelindex 
+          -ow 
+          -c
+
+- **Label Assignment/Testing:**
+Classification and testing on a holdout set can then be performed via `mahout 
testnb`. Again, the -c option indicates that the model is CBayes.  The -seq 
option tells `mahout testnb` to run sequentially:
+
+        mahout testnb 
+          -i ${PATH_TO_TFIDF_TEST_VECTORS}
+          -m ${PATH_TO_MODEL}/model 
+          -l ${PATH_TO_MODEL}/labelindex 
+          -ow 
+          -o ${PATH_TO_OUTPUT} 
+          -c 
+          -seq
+
+## Command line options
+
+- **Preprocessing:**
+  
+  Only relevant parameters used for Bayes/CBayes as detailed above are shown. 
Several other transformations can be performed by `mahout seq2sparse` and used 
as input to Bayes/CBayes.  For a full list of `mahout seq2Sparse` options see 
the [Creating vectors from 
text](https://mahout.apache.org/users/basics/creating-vectors-from-text.html) 
page.
+
+        mahout seq2sparse                         
+          --output (-o) output             The directory pathname for output.  
      
+          --input (-i) input               Path to job input directory.        
      
+          --weight (-wt) weight            The kind of weight to use. 
Currently TF   
+                                               or TFIDF. Default: TFIDF        
          
+          --norm (-n) norm                 The norm to use, expressed as 
either a    
+                                               float or "INF" if you want to 
use the     
+                                               Infinite norm.  Must be greater 
or equal  
+                                               to 0.  The default is not to 
normalize    
+          --overwrite (-ow)                If set, overwrite the output 
directory    
+          --sequentialAccessVector (-seq)  (Optional) Whether output vectors 
should  
+                                               be SequentialAccessVectors. If 
set true   
+                                               else false                      
          
+          --namedVector (-nv)              (Optional) Whether output vectors 
should  
+                                               be NamedVectors. If set true 
else false   
+
+- **Training:**
+
+        mahout trainnb
+          --input (-i) input               Path to job input directory.        
         
+          --output (-o) output             The directory pathname for output.  
         
+          --labels (-l) labels             Comma-separated list of labels to 
include in 
+                                               training                        
             
+          --extractLabels (-el)            Extract the labels from the input   
         
+          --alphaI (-a) alphaI             Smoothing parameter. Default is 1.0
+          --trainComplementary (-c)        Train complementary? Default is 
false.                        
+          --labelIndex (-li) labelIndex    The path to store the label index 
in         
+          --overwrite (-ow)                If present, overwrite the output 
directory   
+                                               before running job              
             
+          --help (-h)                      Print out help                      
         
+          --tempDir tempDir                Intermediate output directory       
         
+          --startPhase startPhase          First phase to run                  
         
+          --endPhase endPhase              Last phase to run
+
+- **Testing:**
+
+        mahout testnb   
+          --input (-i) input               Path to job input directory.        
          
+          --output (-o) output             The directory pathname for output.  
          
+          --overwrite (-ow)                If present, overwrite the output 
directory    
+                                               before running job              
                                  
+
+      
+          --model (-m) model               The path to the model built during 
training   
+          --testComplementary (-c)         Test complementary? Default is 
false.                          
+          --runSequential (-seq)           Run sequential?                     
          
+          --labelIndex (-l) labelIndex     The path to the location of the 
label index   
+          --help (-h)                      Print out help                      
          
+          --tempDir tempDir                Intermediate output directory       
          
+          --startPhase startPhase          First phase to run                  
          
+          --endPhase endPhase              Last phase to run  
+
+
+## Examples
+
+Mahout provides an example for Naive Bayes classification:
+
+1. [Classify 20 Newsgroups](twenty-newsgroups.html)
+ 
+## References
+
+[1]: Jason D. M. Rennie, Lawerence Shih, Jamie Teevan, David Karger (2003). 
[Tackling the Poor Assumptions of Naive Bayes Text 
Classifiers](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf). 
Proceedings of the Twentieth International Conference on Machine Learning 
(ICML-2003).
+
+

Added: 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/breiman-example.mdtext
URL: 
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/breiman-example.mdtext?rev=1667878&view=auto
==============================================================================
--- 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/breiman-example.mdtext
 (added)
+++ 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/breiman-example.mdtext
 Thu Mar 19 21:21:28 2015
@@ -0,0 +1,62 @@
+Title: Breiman Example
+
+#Breiman Example
+
+#### Introduction
+
+This page describes how to run the Breiman example, which implements the test 
procedure described in [Leo Breiman's 
paper](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.23.3999&rep=rep1&type=pdf).
 The basic algorithm is as follows :
+
+ * repeat *I* iterations
+ * in each iteration do
+  * keep 10% of the dataset apart as a testing set 
+  * build two forests using the training set, one with *m = int(log2(M) + 1)* 
(called Random-Input) and one with *m = 1* (called Single-Input)
+  * choose the forest that gave the lowest oob error estimation to compute
+the test set error
+  * compute the test set error using the Single Input Forest (test error),
+this demonstrates that even with *m = 1*, Decision Forests give comparable
+results to greater values of *m*
+  * compute the mean testset error using every tree of the chosen forest
+(tree error). This should indicate how well a single Decision Tree performs
+ * compute the mean test error for all iterations
+ * compute the mean tree error for all iterations
+
+
+#### Running the Example
+
+The current implementation is compatible with the [UCI 
repository](http://archive.ics.uci.edu/ml/) file format. We'll show how to run 
this example on two datasets:
+
+First, we deal with [Glass 
Identification](http://archive.ics.uci.edu/ml/datasets/Glass+Identification): 
download the 
[dataset](http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data)
 file called **glass.data** and store it onto your local machine. Next, we must 
generate the descriptor file **glass.info** for this dataset with the following 
command:
+
+    bin/mahout org.apache.mahout.classifier.df.tools.Describe -p 
/path/to/glass.data -f /path/to/glass.info -d I 9 N L
+
+Substitute */path/to/* with the folder where you downloaded the dataset, the 
argument "I 9 N L" indicates the nature of the variables. Here it means 1
+ignored (I) attribute, followed by 9 numerical(N) attributes, followed by
+the label (L).
+
+Finally, we build and evaluate our random forest classifier as follows:
+
+    bin/mahout org.apache.mahout.classifier.df.BreimanExample -d 
/path/to/glass.data -ds /path/to/glass.info -i 10 -t 100
+which builds 100 trees (-t argument) and repeats the test 10 iterations (-i
+argument) 
+
+The example outputs the following results:
+
+ * Selection error: mean test error for the selected forest on all iterations
+ * Single Input error: mean test error for the single input forest on all
+iterations
+ * One Tree error: mean single tree error on all iterations
+ * Mean Random Input Time: mean build time for random input forests on all
+iterations
+ * Mean Single Input Time: mean build time for single input forests on all
+iterations
+
+We can repeat this for a 
[Sonar](http://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+%28Sonar,+Mines+vs.+Rocks%29)
 usecase: download the 
[dataset](http://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data)
 file called **sonar.all-data** and store it onto your local machine. Generate 
the descriptor file **sonar.info** for this dataset with the following command:
+
+    bin/mahout org.apache.mahout.classifier.df.tools.Describe -p 
/path/to/sonar.all-data -f /path/to/sonar.info -d 60 N L
+
+The argument "60 N L" means 60 numerical(N) attributes, followed by the label 
(L). Analogous to the previous case, we run the evaluation as follows:
+
+    bin/mahout org.apache.mahout.classifier.df.BreimanExample -d 
/path/to/sonar.all-data -ds /path/to/sonar.info -i 10 -t 100
+
+
+

Added: 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/class-discovery.mdtext
URL: 
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/class-discovery.mdtext?rev=1667878&view=auto
==============================================================================
--- 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/class-discovery.mdtext
 (added)
+++ 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/class-discovery.mdtext
 Thu Mar 19 21:21:28 2015
@@ -0,0 +1,150 @@
+Title: Class Discovery
+<a name="ClassDiscovery-ClassDiscovery"></a>
+# Class Discovery
+
+See http://www.cs.bham.ac.uk/~wbl/biblio/gecco1999/GP-417.pdf
+
+CDGA uses a Genetic Algorithm to discover a classification rule for a given
+dataset. 
+A dataset can be seen as a table:
+
+<table>
+<tr><th> </th><th>attribute 1</th><th>attribute 
2</th><th>...</th><th>attribute N</th></tr>
+<tr><td>row 
1</td><td>value1</td><td>value2</td><td>...</td><td>valueN</td></tr>
+<tr><td>row 
2</td><td>value1</td><td>value2</td><td>...</td><td>valueN</td></tr>
+<tr><td>...</td><td>...</td><td>...</td><td>...</td><td>...</td></tr>
+<tr><td>row 
M</td><td>value1</td><td>value2</td><td>...</td><td>valueN</td></tr>
+</table>
+
+An attribute can be numerical, for example a "temperature" attribute, or
+categorical, for example a "color" attribute. For classification purposes,
+one of the categorical attributes is designated as a *label*, which means
+that its value defines the *class* of the rows.
+A classification rule can be represented as follows:
+<table>
+<tr><th> </th><th>attribute 1</th><th>attribute 
2</th><th>...</th><th>attribute N</th></tr>
+<tr><td>weight</td><td>w1</td><td>w2</td><td>...</td><td>wN</td></tr>
+<tr><td>operator</td><td>op1</td><td>op2</td><td>...</td><td>opN</td></tr>
+<tr><td>value</td><td>value1</td><td>value2</td><td>...</td><td>valueN</td></tr>
+</table>
+
+For a given *target* class and a weight *threshold*, the classification
+rule can be read :
+
+
+    for each row of the dataset
+      if (rule.w1 < threshold || (rule.w1 >= threshold && row.value1 rule.op1
+rule.value1)) &&
+         (rule.w2 < threshold || (rule.w2 >= threshold && row.value2 rule.op2
+rule.value2)) &&
+         ...
+         (rule.wN < threshold || (rule.wN >= threshold && row.valueN rule.opN
+rule.valueN)) then
+        row is part of the target class
+
+
+*Important:* The label attribute is not evaluated by the rule.
+
+The threshold parameter allows some conditions of the rule to be skipped if
+their weight is too small. The operators available depend on the attribute
+types:
+* for a numerical attributes, the available operators are '<' and '>='
+* for categorical attributes, the available operators are '!=' and '=='
+
+The "threshold" and "target" are user defined parameters, and because the
+label is always a categorical attribute, the target is the (zero based)
+index of the class label value in all the possible values of the label. For
+example, if the label attribute can have the following values (blue, brown,
+green), then a target of 1 means the "blue" class.
+
+For example, we have the following dataset (the label attribute is "Eyes
+Color"):
+<table>
+<tr><th> </th><th>Age</th><th>Eyes Color</th><th>Hair Color</th></tr>
+<tr><td>row 1</td><td>16</td><td>brown</td><td>dark</td></tr>
+<tr><td>row 2</td><td>25</td><td>green</td><td>light</td></tr>
+<tr><td>row 3</td><td>12</td><td>blue</td><td>light</td></tr>
+and a classification rule:
+<tr><td>weight</td><td>0</td><td>1</td></tr>
+<tr><td>operator</td><td><</td><td>!=</td></tr>
+<tr><td>value</td><td>20</td><td>light</td></tr>
+and the following parameters: threshold = 1 and target = 0 (brown).
+</table>
+
+This rule can be read as follows:
+
+    for each row of the dataset
+      if (0 < 1 || (0 >= 1 && row.value1 < 20)) &&
+         (1 < 1 || (1 >= 1 && row.value2 != light)) then
+        row is part of the "brown Eye Color" class
+
+
+Please note how the rule skipped the label attribute (Eye Color), and how
+the first condition is ignored because its weight is < threshold.
+
+<a name="ClassDiscovery-Runningtheexample:"></a>
+# Running the example:
+NOTE: Substitute in the appropriate version for the Mahout JOB jar
+
+1. cd <MAHOUT_HOME>/examples
+1. ant job
+1. {code}<HADOOP_HOME>/bin/hadoop dfs -put
+<MAHOUT_HOME>/examples/src/test/resources/wdbc wdbc{code}
+1. {code}<HADOOP_HOME>/bin/hadoop dfs -put
+<MAHOUT_HOME>/examples/src/test/resources/wdbc.infos wdbc.infos{code}
+1. {code}<HADOOP_HOME>/bin/hadoop jar
+<MAHOUT_HOME>/examples/build/apache-mahout-examples-0.1-dev.job
+org.apache.mahout.ga.watchmaker.cd.CDGA
+<MAHOUT_HOME>/examples/src/test/resources/wdbc 1 0.9 1 0.033 0.1 0 100 10
+
+    CDGA needs 9 parameters:
+    * param 1 : path of the directory that contains the dataset and its infos
+file
+    * param 2 : target class
+    * param 3 : threshold
+    * param 4 : number of crossover points for the multi-point crossover
+    * param 5 : mutation rate
+    * param 6 : mutation range
+    * param 7 : mutation precision
+    * param 8 : population size
+    * param 9 : number of generations before the program stops
+    
+    For more information about 4th parameter, please see [Multi-point 
Crossover|http://www.geatbx.com/docu/algindex-03.html#P616_36571]
+.
+    For a detailed explanation about the 5th, 6th and 7th parameters, please
+see [Real Valued 
Mutation|http://www.geatbx.com/docu/algindex-04.html#P659_42386]
+.
+    
+    *TODO*: Fill in where to find the output and what it means.
+    
+    h1. The info file:
+    To run properly, CDGA needs some informations about the dataset. Each
+dataset should be accompanied by an .infos file that contains the needed
+informations. for each attribute a corresponding line in the info file
+describes it, it can be one of the following:
+    * IGNORED
+      if the attribute is ignored
+    * LABEL, val1, val2,...
+      if the attribute is the label (class), and its possible values
+    * CATEGORICAL, val1, val2,...
+      if the attribute is categorial (nominal), and its possible values
+    * NUMERICAL, min, max
+      if the attribute is numerical, and its min and max values
+    
+    This file can be generated automaticaly using a special tool available with
+CDGA.
+    
+
+
+*  the tool searches for an existing infos file (*must be filled by the
+user*), in the same directory of the dataset with the same name and with
+the ".infos" extension, that contain the type of the attributes:
+  ** 'N' numerical attribute
+  ** 'C' categorical attribute
+  ** 'L' label (this also a categorical attribute)
+  ** 'I' to ignore the attribute
+  each attribute is in a separate 
+* A Hadoop job is used to parse the dataset and collect the informations.
+This means that *the dataset can be distributed over HDFS*.
+* the results are written back in the same .info file, with the correct
+format needed by CDGA.

Added: 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/classifyingyourdata.mdtext
URL: 
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/classifyingyourdata.mdtext?rev=1667878&view=auto
==============================================================================
--- 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/classifyingyourdata.mdtext
 (added)
+++ 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/classifyingyourdata.mdtext
 Thu Mar 19 21:21:28 2015
@@ -0,0 +1,22 @@
+Title: ClassifyingYourData
+
+# Classifying data from the command line
+
+
+After you've done the [Quickstart](../basics/quickstart.html) and are familiar 
with the basics of Mahout, it is time to build a
+classifier from your own data. The following pieces *may* be useful for in 
getting started:
+
+<a name="ClassifyingYourData-Input"></a>
+# Input
+
+For starters, you will need your data in an appropriate Vector format: See 
[Creating Vectors](../basics/creating-vectors.html) as well as [Creating 
Vectors from Text](../basics/creating-vectors-from-text.html).
+
+<a name="ClassifyingYourData-RunningtheProcess"></a>
+# Running the Process
+
+* Logistic regression [background](logistic-regression.html)
+* [Naive Bayes background](naivebayes.html) and 
[commandline](bayesian-commandline.html) options.
+* [Complementary naive bayes background](complementary-naive-bayes.html), 
[design](https://issues.apache.org/jira/browse/mahout-60.html), and 
[c-bayes-commandline](c-bayes-commandline.html)
+* [Random Forests 
Classification](https://cwiki.apache.org/confluence/display/MAHOUT/Random+Forests)
 comes with a [Breiman example](breiman-example.html). There is some really 
great documentation
+over at [Mark Needham's 
blog](http://www.markhneedham.com/blog/2012/10/27/kaggle-digit-recognizer-mahout-random-forest-attempt/).
 Also checkout the description on [Xiaomeng Shawn Wan
+s](http://shawnwan.wordpress.com/2012/06/01/mahout-0-7-random-forest-examples/)
 blog.
\ No newline at end of file

Added: 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/hidden-markov-models.mdtext
URL: 
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/hidden-markov-models.mdtext?rev=1667878&view=auto
==============================================================================
--- 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/hidden-markov-models.mdtext
 (added)
+++ 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/hidden-markov-models.mdtext
 Thu Mar 19 21:21:28 2015
@@ -0,0 +1,97 @@
+Title: Hidden Markov Models
+
+# Hidden Markov Models
+
+<a name="HiddenMarkovModels-IntroductionandUsage"></a>
+## Introduction and Usage
+
+Hidden Markov Models are used in multiple areas of Machine Learning, such
+as speech recognition, handwritten letter recognition or natural language
+processing. 
+
+<a name="HiddenMarkovModels-FormalDefinition"></a>
+## Formal Definition
+
+A Hidden Markov Model (HMM) is a statistical model of a process consisting
+of two (in our case discrete) random variables O and Y, which change their
+state sequentially. The variable Y with states \{y_1, ... , y_n\} is called
+the "hidden variable", since its state is not directly observable. The
+state of Y changes sequentially with a so called - in our case first-order
+- Markov Property. This means, that the state change probability of Y only
+depends on its current state and does not change in time. Formally we
+write: P(Y(t+1)=y_i|Y(0)...Y(t)) = P(Y(t+1)=y_i|Y(t)) = P(Y(2)=y_i|Y(1)).
+The variable O with states \{o_1, ... , o_m\} is called the "observable
+variable", since its state can be directly observed. O does not have a
+Markov Property, but its state probability depends statically on the
+current state of Y.
+
+Formally, an HMM is defined as a tuple M=(n,m,P,A,B), where n is the number of 
hidden states, m is the number of observable states, P is an n-dimensional 
vector containing initial hidden state probabilities, A is the nxn-dimensional 
"transition matrix" containing the transition probabilities such that 
A\[i,j\](i,j\.html)
+=P(Y(t)=y_i|Y(t-1)=y_j) and B is the mxn-dimensional "emission matrix"
+containing the observation probabilities such that B\[i,j\]=
+P(O=o_i|Y=y_j).
+
+<a name="HiddenMarkovModels-Problems"></a>
+## Problems
+
+Rabiner \[1\](1\.html)
+ defined three main problems for HMM models:
+
+1. Evaluation: Given a sequence O of observations and a model M, what is
+the probability P(O|M) that sequence O was generated by model M. The
+Evaluation problem can be efficiently solved using the Forward algorithm
+2. Decoding: Given a sequence O of observations and a model M, what is
+the most likely sequence Y*=argmax(Y) P(O|M,Y) of hidden variables to
+generate this sequence. The Decoding problem can be efficiently solved
+using the Viterbi algorithm.
+3. Learning: Given a sequence O of observations, what is the most likely
+model M*=argmax(M)P(O|M) to generate this sequence. The Learning problem
+can be efficiently solved using the Baum-Welch algorithm.
+
+<a name="HiddenMarkovModels-Example"></a>
+## Example
+
+To build a Hidden Markov Model and use it to build some predictions, try a 
simple example like this:
+
+Create an input file to train the model.  Here we have a sequence drawn from 
the set of states 0, 1, 2, and 3, separated by space characters.
+
+    $ echo "0 1 2 2 2 1 1 0 0 3 3 3 2 1 2 1 1 1 1 2 2 2 0 0 0 0 0 0 2 2 2 0 0 
0 0 0 0 2 2 2 3 3 3 3 3 3 2 3 2 3 2 3 2 1 3 0 0 0 1 0 1 0 2 1 2 1 2 1 2 3 3 3 3 
2 2 3 2 1 1 0" > hmm-input
+
+Now run the baumwelch job to train your model, after first setting 
MAHOUT_LOCAL to true, to use your local file system.
+
+    $ export MAHOUT_LOCAL=true
+    $ $MAHOUT_HOME/bin/mahout baumwelch -i hmm-input -o hmm-model -nh 3 -no 4 
-e .0001 -m 1000
+
+Output like the following should appear in the console.
+
+    Initial probabilities: 
+    0 1 2 
+    1.0 0.0 3.5659361683006626E-251 
+    Transition matrix:
+      0 1 2 
+    0 6.098919959130616E-5 0.9997275322964165 2.1147850399214744E-4 
+    1 7.404648706054873E-37 0.9086408633885092 0.09135913661149081 
+    2 0.2284374545687356 7.01786289571088E-11 0.7715625453610858 
+    Emission matrix: 
+      0 1 2 3 
+    0 0.9999997858591223 2.0536163836449762E-39 2.1414087769942127E-7 
1.052441093535389E-27 
+    1 7.495656581383351E-34 0.2241269055449904 0.4510889999455847 
0.32478409450942497 
+    2 0.815051477991782 0.18494852200821799 8.465660634827592E-33 
2.8603899591778015E-36 
+    14/03/22 09:52:21 INFO driver.MahoutDriver: Program took 180 ms (Minutes: 
0.003)
+
+The model trained with the input set now is in the file 'hmm-model', which we 
can use to build a predicted sequence.
+
+    $ $MAHOUT_HOME/bin/mahout hmmpredict -m hmm-model -o hmm-predictions -l 10
+
+To see the predictions:
+
+    $ cat hmm-predictions 
+    0 1 3 3 2 2 2 2 1 2
+
+
+<a name="HiddenMarkovModels-Resources"></a>
+## Resources
+
+\[1\]
+ Lawrence R. Rabiner (February 1989). "A tutorial on Hidden Markov Models
+and selected applications in speech recognition". Proceedings of the IEEE
+77 (2): 257-286. doi:10.1109/5.18626.
\ No newline at end of file

Added: 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/locally-weighted-linear-regression.mdtext
URL: 
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/locally-weighted-linear-regression.mdtext?rev=1667878&view=auto
==============================================================================
--- 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/locally-weighted-linear-regression.mdtext
 (added)
+++ 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/locally-weighted-linear-regression.mdtext
 Thu Mar 19 21:21:28 2015
@@ -0,0 +1,20 @@
+Title: Locally Weighted Linear Regression
+
+<a name="LocallyWeightedLinearRegression-LocallyWeightedLinearRegression"></a>
+# Locally Weighted Linear Regression
+
+Model-based methods, such as SVM, Naive Bayes and the mixture of Gaussians,
+use the data to build a parameterized model. After training, the model is
+used for predictions and the data are generally discarded. In contrast,
+"memory-based" methods are non-parametric approaches that explicitly retain
+the training data, and use it each time a prediction needs to be made.
+Locally weighted regression (LWR) is a memory-based method that performs a
+regression around a point of interest using only training data that are
+"local" to that point. Source:
+http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume4/cohn96a-html/node7.html
+
+<a name="LocallyWeightedLinearRegression-Strategyforparallelregression"></a>
+## Strategy for parallel regression
+
+<a name="LocallyWeightedLinearRegression-Designofpackages"></a>
+## Design of packages

Added: 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/logistic-regression.mdtext
URL: 
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/logistic-regression.mdtext?rev=1667878&view=auto
==============================================================================
--- 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/logistic-regression.mdtext
 (added)
+++ 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/logistic-regression.mdtext
 Thu Mar 19 21:21:28 2015
@@ -0,0 +1,106 @@
+Title: Logistic Regression
+
+<a name="LogisticRegression-LogisticRegression(SGD)"></a>
+# Logistic Regression (SGD)
+
+Logistic regression is a model used for prediction of the probability of
+occurrence of an event. It makes use of several predictor variables that
+may be either numerical or categories.
+
+Logistic regression is the standard industry workhorse that underlies many
+production fraud detection and advertising quality and targeting products. 
+The Mahout implementation uses Stochastic Gradient Descent (SGD) to all
+large training sets to be used.
+
+For a more detailed analysis of the approach, have a look at the [thesis of
+Paul 
Komarek](http://www.autonlab.org/autonweb/14709/version/4/part/5/data/komarek:lr_thesis.pdf?branch=main&language=en).
+
+See MAHOUT-228 for the main JIRA issue for SGD.
+
+
+<a name="LogisticRegression-Parallelizationstrategy"></a>
+## Parallelization strategy
+
+The bad news is that SGD is an inherently sequential algorithm.  The good
+news is that it is blazingly fast and thus it is not a problem for Mahout's
+implementation to handle training sets of tens of millions of examples. 
+With the down-sampling typical in many data-sets, this is equivalent to a
+dataset with billions of raw training examples.
+
+The SGD system in Mahout is an online learning algorithm which means that
+you can learn models in an incremental fashion and that you can do
+performance testing as your system runs.  Often this means that you can
+stop training when a model reaches a target level of performance.  The SGD
+framework includes classes to do on-line evaluation using cross validation
+(the CrossFoldLearner) and an evolutionary system to do learning
+hyper-parameter optimization on the fly (the AdaptiveLogisticRegression). 
+The AdaptiveLogisticRegression system makes heavy use of threads to
+increase machine utilization.  The way it works is that it runs 20
+CrossFoldLearners in separate threads, each with slightly different
+learning parameters.  As better settings are found, these new settings are
+propagating to the other learners.
+
+<a name="LogisticRegression-Designofpackages"></a>
+## Design of packages
+
+There are three packages that are used in Mahout's SGD system. These
+include
+
+* The vector encoding package (found in org.apache.mahout.vectorizer.encoders)
+
+* The SGD learning package (found in org.apache.mahout.classifier.sgd)
+
+* The evolutionary optimization system (found in org.apache.mahout.ep)
+
+<a name="LogisticRegression-Featurevectorencoding"></a>
+### Feature vector encoding
+
+Because the SGD algorithms need to have fixed length feature vectors and
+because it is a pain to build a dictionary ahead of time, most SGD
+applications use the hashed feature vector encoding system that is rooted
+at FeatureVectorEncoder.
+
+The basic idea is that you create a vector, typically a
+RandomAccessSparseVector, and then you use various feature encoders to
+progressively add features to that vector.  The size of the vector should
+be large enough to avoid feature collisions as features are hashed.
+
+There are specialized encoders for a variety of data types.  You can
+normally encode either a string representation of the value you want to
+encode or you can encode a byte level representation to avoid string
+conversion.  In the case of ContinuousValueEncoder and
+ConstantValueEncoder, it is also possible to encode a null value and pass
+the real value in as a weight. This avoids numerical parsing entirely in
+case you are getting your training data from a system like Avro.
+
+Here is a class diagram for the encoders package:
+
+![class diagram](../../images/vector-class-hierarchy.png)
+
+<a name="LogisticRegression-SGDLearning"></a>
+### SGD Learning
+
+For the simplest applications, you can construct an
+OnlineLogisticRegression and be off and running.  Typically, though, it is
+nice to have running estimates of performance on held out data.  To do
+that, you should use a CrossFoldLearner which keeps a stable of five (by
+default) OnlineLogisticRegression objects.  Each time you pass a training
+example to a CrossFoldLearner, it passes this example to all but one of its
+children as training and passes the example to the last child to evaluate
+current performance.  The children are used for evaluation in a round-robin
+fashion so, if you are using the default 5 way split, all of the children
+get 80% of the training data for training and get 20% of the data for
+evaluation.
+
+To avoid the pesky need to configure learning rates, regularization
+parameters and annealing schedules, you can use the
+AdaptiveLogisticRegression.  This class maintains a pool of
+CrossFoldLearners and adapts learning rates and regularization on the fly
+so that you don't have to.
+
+Here is a class diagram for the classifiers.sgd package.  As you can see,
+the number of twiddlable knobs is pretty large.  For some examples, see the
+TrainNewsGroups example code.
+
+![sgd class diagram](../../images/sgd-class-hierarchy.png)
+

Added: 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/naivebayes.mdtext
URL: 
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/naivebayes.mdtext?rev=1667878&view=auto
==============================================================================
--- 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/naivebayes.mdtext
 (added)
+++ 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/naivebayes.mdtext
 Thu Mar 19 21:21:28 2015
@@ -0,0 +1,39 @@
+Title: NaiveBayes
+<a name="NaiveBayes-NaiveBayes"></a>
+# Naive Bayes
+
+Naive Bayes is an algorithm that can be used to classify objects into
+usually binary categories. It is one of the most common learning algorithms
+in spam filters. Despite its simplicity and rather naive assumptions it has
+proven to work surprisingly well in practice.
+
+Before applying the algorithm, the objects to be classified need to be
+represented by numerical features. In the case of e-mail spam each feature
+might indicate whether some specific word is present or absent in the mail
+to classify. The algorithm comes in two phases: Learning and application.
+During learning, a set of feature vectors is given to the algorithm, each
+vector labeled with the class the object it represents, belongs to. From
+that it is deduced which combination of features appears with high
+probability in spam messages. Given this information, during application
+one can easily compute the probability of a new message being either spam
+or not.
+
+The algorithm does make several assumptions, that are not true for most
+datasets, but make computations easier. The worst probably being, that all
+features of an objects are considered independent. In practice, that means,
+given the phrase "Statue of Liberty" was already found in a text, does not
+influence the probability of seeing the phrase "New York" as well.
+
+<a name="NaiveBayes-StrategyforaparallelNaiveBayes"></a>
+## Strategy for a parallel Naive Bayes
+
+See 
[https://issues.apache.org/jira/browse/MAHOUT-9](https://issues.apache.org/jira/browse/MAHOUT-9)
+.
+
+
+<a name="NaiveBayes-Examples"></a>
+## Examples
+
+[20Newsgroups](20newsgroups.html)
+ - Example code showing how to train and use the Naive Bayes classifier
+using the 20 Newsgroups data available at 
[http://people.csail.mit.edu/jrennie/20Newsgroups/]

Added: 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/neural-network.mdtext
URL: 
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/neural-network.mdtext?rev=1667878&view=auto
==============================================================================
--- 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/neural-network.mdtext
 (added)
+++ 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/neural-network.mdtext
 Thu Mar 19 21:21:28 2015
@@ -0,0 +1,16 @@
+Title: Neural Network
+<a name="NeuralNetwork-NeuralNetworks"></a>
+# Neural Networks
+
+Neural Networks are a means for classifying multi dimensional objects. We
+concentrate on implementing back propagation networks with one hidden layer
+as these networks have been covered by the [2006 NIPS map reduce 
paper](http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf)
+. Those networks are capable of learning not only linear separating hyper
+planes but arbitrary decision boundaries.
+
+<a name="NeuralNetwork-Strategyforparallelbackpropagationnetwork"></a>
+## Strategy for parallel backpropagation network
+
+
+<a name="NeuralNetwork-Designofimplementation"></a>
+## Design of implementation

Added: 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/partial-implementation.mdtext
URL: 
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/partial-implementation.mdtext?rev=1667878&view=auto
==============================================================================
--- 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/partial-implementation.mdtext
 (added)
+++ 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/partial-implementation.mdtext
 Thu Mar 19 21:21:28 2015
@@ -0,0 +1,140 @@
+Title: Partial Implementation
+
+# Classifying with random forests
+
+<a name="PartialImplementation-Introduction"></a>
+# Introduction
+
+This quick start page shows how to build a decision forest using the
+partial implementation. This tutorial also explains how to use the decision
+forest to classify new data.
+Partial Decision Forests is a mapreduce implementation where each mapper
+builds a subset of the forest using only the data available in its
+partition. This allows building forests using large datasets as long as
+each partition can be loaded in-memory.
+
+<a name="PartialImplementation-Steps"></a>
+# Steps
+<a name="PartialImplementation-Downloadthedata"></a>
+## Download the data
+* The current implementation is compatible with the UCI repository file
+format. In this example we'll use the NSL-KDD dataset because its large
+enough to show the performances of the partial implementation.
+You can download the dataset here http://nsl.cs.unb.ca/NSL-KDD/
+You can either download the full training set "KDDTrain+.ARFF", or a 20%
+subset "KDDTrain+_20Percent.ARFF" (we'll use the full dataset in this
+tutorial) and the test set "KDDTest+.ARFF".
+* Open the train and test files and remove all the lines that begin with
+'@'. All those lines are at the top of the files. Actually you can keep
+those lines somewhere, because they'll help us describe the dataset to
+Mahout
+* Put the data in HDFS: {code}
+$HADOOP_HOME/bin/hadoop fs -mkdir testdata
+$HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata{code}
+
+<a name="PartialImplementation-BuildtheJobfiles"></a>
+## Build the Job files
+* In $MAHOUT_HOME/ run: {code}mvn clean install -DskipTests{code}
+
+<a name="PartialImplementation-Generateafiledescriptorforthedataset:"></a>
+## Generate a file descriptor for the dataset: 
+run the following command:
+
+    $HADOOP_HOME/bin/hadoop jar
+$MAHOUT_HOME/core/target/mahout-core-<VERSION>-job.jar
+org.apache.mahout.classifier.df.tools.Describe -p testdata/KDDTrain+.arff
+-f testdata/KDDTrain+.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L
+
+The "N 3 C 2 N C 4 N C 8 N 2 C 19 N L" string describes all the attributes
+of the data. In this cases, it means 1 numerical(N) attribute, followed by
+3 Categorical(C) attributes, ...L indicates the label. You can also use 'I'
+to ignore some attributes
+
+<a name="PartialImplementation-Runtheexample"></a>
+## Run the example
+
+
+    $HADOOP_HOME/bin/hadoop jar
+$MAHOUT_HOME/examples/target/mahout-examples-<version>-job.jar
+org.apache.mahout.classifier.df.mapreduce.BuildForest
+-Dmapred.max.split.size=1874231 -d testdata/KDDTrain+.arff -ds
+testdata/KDDTrain+.info -sl 5 -p -t 100 -o nsl-forest
+
+which builds 100 trees (-t argument) using the partial implementation (-p).
+Each tree is built using 5 random selected attribute per node (-sl
+argument) and the example outputs the decision tree in the "nsl-forest"
+directory (-o).
+The number of partitions is controlled by the -Dmapred.max.split.size
+argument that indicates to Hadoop the max. size of each partition, in this
+case 1/10 of the size of the dataset. Thus 10 partitions will be used.
+IMPORTANT: using less partitions should give better classification results,
+but needs a lot of memory. So if the Jobs are failing, try increasing the
+number of partitions.
+* The example outputs the Build Time and the oob error estimation
+
+
+    10/03/13 17:57:29 INFO mapreduce.BuildForest: Build Time: 0h 7m 43s 582
+    10/03/13 17:57:33 INFO mapreduce.BuildForest: oob error estimate :
+0.002325895231517865
+    10/03/13 17:57:33 INFO mapreduce.BuildForest: Storing the forest in:
+nsl-forest/forest.seq
+
+
+<a name="PartialImplementation-UsingtheDecisionForesttoClassifynewdata"></a>
+## Using the Decision Forest to Classify new data
+run the following command:
+
+    $HADOOP_HOME/bin/hadoop jar
+$MAHOUT_HOME/examples/target/mahout-examples-<version>-job.jar
+org.apache.mahout.classifier.df.mapreduce.TestForest -i
+nsl-kdd/KDDTest+.arff -ds nsl-kdd/KDDTrain+.info -m nsl-forest -a -mr -o
+predictions
+
+This will compute the predictions of "KDDTest+.arff" dataset (-i argument)
+using the same data descriptor generated for the training dataset (-ds) and
+the decision forest built previously (-m). Optionally (if the test dataset
+contains the labels of the tuples) run the analyzer to compute the
+confusion matrix (-a), and you can also store the predictions in a text
+file or a directory of text files(-o). Passing the (-mr) parameter will use
+Hadoop to distribute the classification.
+
+* The example should output the classification time and the confusion
+matrix
+
+
+    10/03/13 18:08:56 INFO mapreduce.TestForest: Classification Time: 0h 0m 6s
+355
+    10/03/13 18:08:56 INFO mapreduce.TestForest:
+=======================================================
+    Summary
+    -------------------------------------------------------
+    Correctly Classified Instances             :      17657       78.3224%
+    Incorrectly Classified Instances   :       4887       21.6776%
+    Total Classified Instances         :      22544
+    
+    =======================================================
+    Confusion Matrix
+    -------------------------------------------------------
+    a  b       <--Classified as
+    9459       252      |  9711        a     = normal
+    4635       8198     |  12833       b     = anomaly
+    Default Category: unknown: 2
+
+
+If the input is a single file then the output will be a single text file,
+in the above example 'predictions' would be one single file. If the input
+if a directory containing for example two files 'a.data' and 'b.data', then
+the output will be a directory 'predictions' containing two files
+'a.data.out' and 'b.data.out'
+
+<a name="PartialImplementation-KnownIssuesandlimitations"></a>
+## Known Issues and limitations
+The "Decision Forest" code is still "a work in progress", many features are
+still missing. Here is a list of some known issues:
+* For now, the training does not support multiple input files. The input
+dataset must be one single file (this support will be available with the 
upcoming release). 
+Classifying new data does support multiple
+input files.
+* The tree building is done when each mapper.close() method is called.
+Because the mappers don't refresh their state, the job can fail when the
+dataset is big and you try to build a large number of trees.

Added: 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/random-forests.mdtext
URL: 
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/random-forests.mdtext?rev=1667878&view=auto
==============================================================================
--- 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/random-forests.mdtext
 (added)
+++ 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/random-forests.mdtext
 Thu Mar 19 21:21:28 2015
@@ -0,0 +1,228 @@
+Title: Random Forests
+<a name="RandomForests-HowtogrowaDecisionTree"></a>
+### How to grow a Decision Tree
+
+source : \[3\](3\.html)
+
+LearnUnprunedTree(*X*,*Y*)
+
+Input: *X* a matrix of *R* rows and *M* columns where *X{*}{*}{~}ij{~}* =
+the value of the *j*'th attribute in the *i*'th input datapoint. Each
+column consists of either all real values or all categorical values.
+Input: *Y* a vector of *R* elements, where *Y{*}{*}{~}i{~}* = the output
+class of the *i*'th datapoint. The *Y{*}{*}{~}i{~}* values are categorical.
+Output: An Unpruned decision tree
+
+
+If all records in *X* have identical values in all their attributes (this
+includes the case where *R<2*), return a Leaf Node predicting the majority
+output, breaking ties randomly. This case also includes
+If all values in *Y* are the same, return a Leaf Node predicting this value
+as the output
+Else
+&nbsp;&nbsp;&nbsp; select *m* variables at random out of the *M* variables
+&nbsp;&nbsp;&nbsp; For *j* = 1 .. *m*
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; If *j*'th attribute is
+categorical
+*&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+IG{*}{*}{~}j{~}* = IG(*Y*\|*X{*}{*}{~}j{~}*) (see Information
+Gain)&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Else (*j*'th attribute is
+real-valued)
+*&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+IG{*}{*}{~}j{~}* = IG*(*Y*\|*X{*}{*}{~}j{~}*) (see Information Gain)
+&nbsp;&nbsp;&nbsp; Let *j\** = argmax{~}j~ *IG{*}{*}{~}j{~}* (this is the
+splitting attribute we'll use)
+&nbsp;&nbsp;&nbsp; If *j\** is categorical then
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; For each value *v* of the *j*'th
+attribute
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let
+*X{*}{*}{^}v{^}* = subset of rows of *X* in which *X{*}{*}{~}ij{~}* = *v*.
+Let *Y{*}{*}{^}v{^}* = corresponding subset of *Y*
+&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Let *Child{*}{*}{^}v{^}* =
+LearnUnprunedTree(*X{*}{*}{^}v{^}*,*Y{*}{*}{^}v{^}*)
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Return a decision tree node,
+splitting on *j*'th attribute. The number of children equals the number of
+values of the *j*'th attribute, and the *v*'th child is
+*Child{*}{*}{^}v{^}*
+&nbsp;&nbsp;&nbsp; Else *j\** is real-valued and let *t* be the best split
+threshold
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let *X{*}{*}{^}LO{^}* = subset
+of rows of *X* in which *X{*}{*}{~}ij{~}* *<= t*. Let *Y{*}{*}{^}LO{^}* =
+corresponding subset of *Y*
+&nbsp; &nbsp; &nbsp; &nbsp; Let *Child{*}{*}{^}LO{^}* =
+LearnUnprunedTree(*X{*}{*}{^}LO{^}*,*Y{*}{*}{^}LO{^}*)
+&nbsp; &nbsp; &nbsp; &nbsp; Let *X{*}{*}{^}HI{^}* = subset of rows of *X*
+in which *X{*}{*}{~}ij{~}* *> t*. Let *Y{*}{*}{^}HI{^}* = corresponding
+subset of *Y*
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let *Child{*}{*}{^}HI{^}* =
+LearnUnprunedTree(*X{*}{*}{^}HI{^}*,*Y{*}{*}{^}HI{^}*)
+&nbsp; &nbsp; &nbsp; &nbsp; Return a decision tree node, splitting on
+*j*'th attribute. It has two children corresponding to whether the *j*'th
+attribute is above or below the given threshold.
+
+*Note*: There are alternatives to Information Gain for splitting nodes
+&nbsp;
+
+<a name="RandomForests-Informationgain"></a>
+### Information gain
+
+source : \[3\](3\.html)
+1. h4. nominal attributes
+
+suppose X can have one of m values V{~}1~,V{~}2~,...,V{~}m~
+P(X=V{~}1~)=p{~}1~, P(X=V{~}2~)=p{~}2~,...,P(X=V{~}m~)=p{~}m~
+&nbsp;
+H(X)= \-sum{~}j=1{~}{^}m^ p{~}j~ log{~}2~ p{~}j~ (The entropy of X)
+H(Y\|X=v) = the entropy of Y among only those records in which X has value
+v
+H(Y\|X) = sum{~}j~ p{~}j~ H(Y\|X=v{~}j~)
+IG(Y\|X) = H(Y) - H(Y\|X)
+1. h4. real-valued attributes
+
+suppose X is real valued
+define IG(Y\|X:t) as H(Y) - H(Y\|X:t)
+define H(Y\|X:t) = H(Y\|X<t) P(X<t) + H(Y\|X>=t) P(X>=t)
+define IG*(Y\|X) = max{~}t~ IG(Y\|X:t)
+
+<a name="RandomForests-HowtogrowaRandomForest"></a>
+### How to grow a Random Forest
+
+source : \[1\](1\.html)
+
+Each tree is grown as follows:
+1. if the number of cases in the training set is *N*, sample *N* cases at
+random \-but with replacement, from the original data. This sample will be
+the training set for the growing tree.
+1. if there are *M* input variables, a number *m << M* is specified such
+that at each node, *m* variables are selected at random out of the *M* and
+the best split on these *m* is used to split the node. The value of *m* is
+held constant during the forest growing.
+1. each tree is grown to its large extent possible. There is no pruning.
+
+<a name="RandomForests-RandomForestparameters"></a>
+### Random Forest parameters
+
+source : \[2\](2\.html)
+Random Forests are easy to use, the only 2 parameters a user of the
+technique has to determine are the number of trees to be used and the
+number of variables (*m*) to be randomly selected from the available set of
+variables.
+Breinman's recommendations are to pick a large number of trees, as well as
+the square root of the number of variables for *m*.
+&nbsp;
+
+<a name="RandomForests-Howtopredictthelabelofacase"></a>
+### How to predict the label of a case
+
+Classify(*node*,*V*)
+&nbsp;&nbsp;&nbsp; Input: *node* from the decision tree, if *node.attribute
+= j* then the split is done on the *j*'th attribute
+
+&nbsp;&nbsp; &nbsp;Input: *V* a vector of *M* columns where
+*V{*}{*}{~}j{~}* = the value of the *j*'th attribute.
+&nbsp;&nbsp;&nbsp; Output: label of *V*
+
+&nbsp;&nbsp;&nbsp; If *node* is a Leaf then
+&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; Return the value predicted
+by *node*
+
+&nbsp;&nbsp; &nbsp;Else
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Let *j =
+node.attribute*
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; If *j* is
+categorical then
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+Let *v* = *V{*}{*}{~}j{~}*
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+Let *child{*}{*}{^}v{^}* = child node corresponding to the attribute's
+value *v*
+&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp; Return Classify(*child{*}{*}{^}v{^}*,*V*)
+
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Else *j* is
+real-valued
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+Let *t = node.threshold* (split threshold)
+&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp; If Vj < t then
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
+&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;&nbsp; Let *child{*}{*}{^}LO{^}* = child
+node corresponding to (*<t*)
+&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; Return
+Classify(*child{*}{*}{^}LO{^}*,*V*)
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
+Else
+&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;
+&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; Let *child{*}{*}{^}HI{^}* =
+child node corresponding to (*>=t*)
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; &nbsp; &nbsp;
+&nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp;&nbsp; Return
+Classify(*child{*}{*}{^}HI{^}*,*V*)
+&nbsp;
+
+<a name="RandomForests-Theoutofbag(oob)errorestimation"></a>
+### The out of bag (oob) error estimation
+
+source : \[1\](1\.html)
+
+in random forests, there is no need for cross-validation or a separate test
+set to get an unbiased estimate of the test set error. It is estimated
+internally, during the run, as follows:
+* each tree is constructed using a different bootstrap sample from the
+original data. About one-third of the cases left of the bootstrap sample
+and not used in the construction of the _kth_ tree.
+* put each case left out in the construction of the _kth_ tree down the
+_kth{_}tree to get a classification. In this way, a test set classification
+is obtained for each case in about one-thrid of the trees. At the end of
+the run, take *j* to be the class that got most of the the votes every time
+case *n* was _oob_. The proportion of times that *j* is not equal to the
+true class of *n* averaged over all cases is the _oob error estimate_. This
+has proven to be unbiased in many tests.
+
+<a name="RandomForests-OtherRFuses"></a>
+### Other RF uses
+
+source : \[1\](1\.html)
+* variable importance
+* gini importance
+* proximities
+* scaling
+* prototypes
+* missing values replacement for the training set
+* missing values replacement for the test set
+* detecting mislabeled cases
+* detecting outliers
+* detecting novelties
+* unsupervised learning
+* balancing prediction error
+Please refer to \[1\](1\.html)
+ for a detailed description
+
+<a name="RandomForests-References"></a>
+### References
+
+\[1\](1\.html)
+&nbsp; Random Forests - Classification Description
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; 
&nbsp;[http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm](http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm)
+\[2\](2\.html)
+&nbsp; B. Larivière & D. Van Den Poel, 2004. "Predicting Customer Retention
+and Profitability by Using Random Forests and Regression Forests
+Techniques,"
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Working Papers of Faculty of
+Economics and Business Administration, Ghent University, Belgium 04/282,
+Ghent University,
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Faculty of Economics and
+Business Administration.
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Available online : 
[http://ideas.repec.org/p/rug/rugwps/04-282.html](http://ideas.repec.org/p/rug/rugwps/04-282.html)
+\[3\](3\.html)
+&nbsp; Decision Trees - Andrew W. Moore\[4\]
+&nbsp; &nbsp; &nbsp; &nbsp; http://www.cs.cmu.edu/~awm/tutorials\[1\](1\.html)
+\[4\](4\.html)
+&nbsp; Information Gain - Andrew W. Moore
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
[http://www.cs.cmu.edu/~awm/tutorials](http://www.cs.cmu.edu/~awm/tutorials)

Added: 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/restricted-boltzmann-machines.mdtext
URL: 
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/restricted-boltzmann-machines.mdtext?rev=1667878&view=auto
==============================================================================
--- 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/restricted-boltzmann-machines.mdtext
 (added)
+++ 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/restricted-boltzmann-machines.mdtext
 Thu Mar 19 21:21:28 2015
@@ -0,0 +1,43 @@
+Title: Restricted Boltzmann Machines
+NOTE: This implementation is a Work-In-Progress, at least till September,
+2010. 
+
+The JIRA issue is [here](https://issues.apache.org/jira/browse/MAHOUT-375)
+. 
+
+<a name="RestrictedBoltzmannMachines-BoltzmannMachines"></a>
+### Boltzmann Machines
+Boltzmann Machines are a type of stochastic neural networks that closely
+resemble physical processes. They define a network of units with an overall
+energy that is evolved over a period of time, until it reaches thermal
+equilibrium. 
+
+However, the convergence speed of Boltzmann machines that have
+unconstrained connectivity is low.
+
+<a name="RestrictedBoltzmannMachines-RestrictedBoltzmannMachines"></a>
+### Restricted Boltzmann Machines
+Restricted Boltzmann Machines are a variant, that are 'restricted' in the
+sense that connections between hidden units of a single layer are _not_
+allowed. In addition, stacking multiple RBM's is also feasible, with the
+activities of the hidden units forming the base for a higher-level RBM. The
+combination of these two features renders RBM's highly usable for
+parallelization. 
+
+In the Netflix Prize, RBM's offered distinctly orthogonal predictions to
+SVD and k-NN approaches, and contributed immensely to the final solution.
+
+<a name="RestrictedBoltzmannMachines-RBM'sinApacheMahout"></a>
+### RBM's in Apache Mahout
+An implementation of Restricted Boltzmann Machines is being developed for
+Apache Mahout as a Google Summer of Code 2010 project. A recommender
+interface will also be provided. The key aims of the implementation are:
+1. Accurate - should replicate known results, including those of the Netflix
+Prize
+1. Fast - The implementation uses Map-Reduce, hence, it should be fast
+1. Scale - Should scale to large datasets, with a design whose critical
+parts don't need a dependency between the amount of memory on your cluster
+systems and the size of your dataset
+
+You can view the patch as it develops 
[here](http://github.com/sisirkoppaka/mahout-rbm/compare/trunk...rbm)
+.

Added: 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/support-vector-machines.mdtext
URL: 
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/support-vector-machines.mdtext?rev=1667878&view=auto
==============================================================================
--- 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/support-vector-machines.mdtext
 (added)
+++ 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/support-vector-machines.mdtext
 Thu Mar 19 21:21:28 2015
@@ -0,0 +1,37 @@
+Title: Support Vector Machines
+<a name="SupportVectorMachines-SupportVectorMachines"></a>
+# Support Vector Machines
+
+As with Naive Bayes, Support Vector Machines (or SVMs in short) can be used
+to solve the task of assigning objects to classes. However, the way this
+task is solved is completely different to the setting in Naive Bayes.
+
+Each object is considered to be a point in _n_ dimensional feature space,
+_n_ being the number of features used to describe the objects numerically.
+In addition each object is assigned a binary label, let us assume the
+labels are "positive" and "negative". During learning, the algorithm tries
+to find a hyperplane in that space, that perfectly separates positive from
+negative objects.
+It is trivial to think of settings where this might very well be
+impossible. To remedy this situation, objects can be assigned so called
+slack terms, that punish mistakes made during learning appropriately. That
+way, the algorithm is forced to find the hyperplane that causes the least
+number of mistakes.
+
+Another way to overcome the problem of there being no linear hyperplane to
+separate positive from negative objects is to simply project each feature
+vector into an higher dimensional feature space and search for a linear
+separating hyperplane in that new space. Usually the main problem with
+learning in high dimensional feature spaces is the so called curse of
+dimensionality. That is, there are fewer learning examples available than
+free parameters to tune. In the case of SVMs this problem is less
+detrimental, as SVMs impose additional structural constraints on their
+solutions. Each separating hyperplane needs to have a maximal margin to all
+training examples. In addition, that way, the solution may be based on the
+information encoded in only very few examples.
+
+<a name="SupportVectorMachines-Strategyforparallelization"></a>
+## Strategy for parallelization
+
+<a name="SupportVectorMachines-Designofpackages"></a>
+## Design of packages

Added: 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/twenty-newsgroups.mdtext
URL: 
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/twenty-newsgroups.mdtext?rev=1667878&view=auto
==============================================================================
--- 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/twenty-newsgroups.mdtext
 (added)
+++ 
mahout/site/mahout_cms/trunk/content/users/mapreduce/classification/twenty-newsgroups.mdtext
 Thu Mar 19 21:21:28 2015
@@ -0,0 +1,173 @@
+Title: Twenty Newsgroups
+
+<a name="TwentyNewsgroups-TwentyNewsgroupsClassificationExample"></a>
+## Twenty Newsgroups Classification Example
+
+<a name="TwentyNewsgroups-Introduction"></a>
+## Introduction
+
+The 20 newsgroups dataset is a collection of approximately 20,000
+newsgroup documents, partitioned (nearly) evenly across 20 different
+newsgroups. The 20 newsgroups collection has become a popular data set for
+experiments in text applications of machine learning techniques, such as
+text classification and text clustering. We will use the [Mahout 
CBayes](http://mahout.apache.org/users/mapreduce/classification/bayesian.html)
+classifier to create a model that would classify a new document into one of
+the 20 newsgroups.
+
+<a name="TwentyNewsgroups-Prerequisites"></a>
+### Prerequisites
+
+* Mahout has been downloaded ([instructions 
here](https://mahout.apache.org/general/downloads.html))
+* Maven is available
+* Your environment has the following variables:
+     * **HADOOP_HOME** Environment variables refers to where Hadoop lives 
+     * **MAHOUT_HOME** Environment variables refers to where Mahout lives
+
+<a name="TwentyNewsgroups-Instructionsforrunningtheexample"></a>
+### Instructions for running the example
+
+1. If running Hadoop in cluster mode, start the hadoop daemons by executing 
the following commands:
+
+            $ cd $HADOOP_HOME/bin
+            $ ./start-all.sh
+   
+    Otherwise:
+
+            $ export MAHOUT_LOCAL=true
+
+2. In the trunk directory of Mahout, compile and install Mahout:
+
+            $ cd $MAHOUT_HOME
+            $ mvn -DskipTests clean install
+
+3. Run the [20 newsgroups example 
script](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh)
 by executing:
+
+            $ ./examples/bin/classify-20newsgroups.sh
+
+4. You will be prompted to select a classification method algorithm: 
+    
+            1. Complement Naive Bayes
+            2. Naive Bayes
+            3. Stochastic Gradient Descent
+
+Select 1 and the the script will perform the following:
+
+1. Create a working directory for the dataset and all input/output.
+2. Download and extract the *20news-bydate.tar.gz* from the [20 newsgroups 
dataset](http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz) 
to the working directory.
+3. Convert the full 20 newsgroups dataset into a < Text, Text > SequenceFile. 
+4. Convert and preprocesses the dataset into a < Text, VectorWritable > 
SequenceFile containing term frequencies for each document.
+5. Split the preprocessed dataset into training and testing sets. 
+6. Train the classifier.
+7. Test the classifier.
+
+
+Output should look something like:
+
+
+    =======================================================
+    Confusion Matrix
+    -------------------------------------------------------
+     a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t <--Classified 
as
+    381 0  0  0  0  9  1  0  0  0  1  0  0  2  0  1  0  0  3  0 |398 
a=rec.motorcycles
+     1 284 0  0  0  0  1  0  6  3  11 0  66 3  0  6  0  4  9  0 |395 
b=comp.windows.x
+     2  0 339 2  0  3  5  1  0  0  0  0  1  1  12 1  7  0  2  0 |376 
c=talk.politics.mideast
+     4  0  1 327 0  2  2  0  0  2  1  1  0  5  1  4  12 0  2  0 |364 
d=talk.politics.guns
+     7  0  4  32 27 7  7  2  0  12 0  0  6  0 100 9  7  31 0  0 |251 
e=talk.religion.misc
+     10 0  0  0  0 359 2  2  0  0  3  0  1  6  0  1  0  0  11 0 |396 
f=rec.autos
+     0  0  0  0  0  1 383 9  1  0  0  0  0  0  0  0  0  3  0  0 |397 
g=rec.sport.baseball
+     1  0  0  0  0  0  9 382 0  0  0  0  1  1  1  0  2  0  2  0 |399 
h=rec.sport.hockey
+     2  0  0  0  0  4  3  0 330 4  4  0  5  12 0  0  2  0  12 7 |385 
i=comp.sys.mac.hardware
+     0  3  0  0  0  0  1  0  0 368 0  0  10 4  1  3  2  0  2  0 |394 
j=sci.space
+     0  0  0  0  0  3  1  0  27 2 291 0  11 25 0  0  1  0  13 18|392 
k=comp.sys.ibm.pc.hardware
+     8  0  1 109 0  6  11 4  1  18 0  98 1  3  11 10 27 1  1  0 |310 
l=talk.politics.misc
+     0  11 0  0  0  3  6  0  10 6  11 0 299 13 0  2  13 0  7  8 |389 
m=comp.graphics
+     6  0  1  0  0  4  2  0  5  2  12 0  8 321 0  4  14 0  8  6 |393 
n=sci.electronics
+     2  0  0  0  0  0  4  1  0  3  1  0  3  1 372 6  0  2  1  2 |398 
o=soc.religion.christian
+     4  0  0  1  0  2  3  3  0  4  2  0  7  12 6 342 1  0  9  0 |396 p=sci.med
+     0  1  0  1  0  1  4  0  3  0  1  0  8  4  0  2 369 0  1  1 |396 
q=sci.crypt
+     10 0  4  10 1  5  6  2  2  6  2  0  2  1 86 15 14 152 0  1 |319 
r=alt.atheism
+     4  0  0  0  0  9  1  1  8  1  12 0  3  0  2  0  0  0 341 2 |390 
s=misc.forsale
+     8  5  0  0  0  1  6  0  8  5  50 0  40 2  1  0  9  0  3 256|394 
t=comp.os.ms-windows.misc
+    =======================================================
+    Statistics
+    -------------------------------------------------------
+    Kappa                                       0.8808
+    Accuracy                                   90.8596%
+    Reliability                                86.3632%
+    Reliability (standard deviation)            0.2131
+
+
+
+
+
+<a name="TwentyNewsgroups-ComplementaryNaiveBayes"></a>
+## End to end commands to build a CBayes model for 20 newsgroups
+The [20 newsgroups example 
script](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh)
 issues the following commands as outlined above. We can build a CBayes 
classifier from the command line by following the process in the script: 
+
+*Be sure that **MAHOUT_HOME**/bin and **HADOOP_HOME**/bin are in your 
**$PATH***
+
+1. Create a working directory for the dataset and all input/output.
+           
+            $ export WORK_DIR=/tmp/mahout-work-${USER}
+            $ mkdir -p ${WORK_DIR}
+
+2. Download and extract the *20news-bydate.tar.gz* from the [20newsgroups 
dataset](http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz) 
to the working directory.
+
+            $ curl 
http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz 
+                -o ${WORK_DIR}/20news-bydate.tar.gz
+            $ mkdir -p ${WORK_DIR}/20news-bydate
+            $ cd ${WORK_DIR}/20news-bydate && tar xzf ../20news-bydate.tar.gz 
&& cd .. && cd ..
+            $ mkdir ${WORK_DIR}/20news-all
+            $ cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all
+     * If you're running on a Hadoop cluster:
+ 
+            $ hadoop dfs -put ${WORK_DIR}/20news-all ${WORK_DIR}/20news-all
+
+3. Convert the full 20 newsgroups dataset into a < Text, Text > SequenceFile. 
+          
+            $ mahout seqdirectory 
+                -i ${WORK_DIR}/20news-all 
+                -o ${WORK_DIR}/20news-seq 
+                -ow
+            
+4. Convert and preprocesses the dataset into  a < Text, VectorWritable > 
SequenceFile containing term frequencies for each document. 
+            
+            $ mahout seq2sparse 
+                -i ${WORK_DIR}/20news-seq 
+                -o ${WORK_DIR}/20news-vectors
+                -lnorm 
+                -nv 
+                -wt tfidf
+If we wanted to use different parsing methods or transformations on the term 
frequency vectors we could supply different options here e.g.: -ng 2 for 
bigrams or -n 2 for L2 length normalization.  See the [Creating vectors from 
text](http://mahout.apache.org/users/basics/creating-vectors-from-text.html) 
page for a list of all seq2sparse options.   
+
+5. Split the preprocessed dataset into training and testing sets.
+
+            $ mahout split 
+                -i ${WORK_DIR}/20news-vectors/tfidf-vectors 
+                --trainingOutput ${WORK_DIR}/20news-train-vectors 
+                --testOutput ${WORK_DIR}/20news-test-vectors  
+                --randomSelectionPct 40 
+                --overwrite --sequenceFiles -xm sequential
+ 
+6. Train the classifier.
+
+            $ mahout trainnb 
+                -i ${WORK_DIR}/20news-train-vectors
+                -el  
+                -o ${WORK_DIR}/model 
+                -li ${WORK_DIR}/labelindex 
+                -ow 
+                -c
+
+7. Test the classifier.
+
+            $ mahout testnb 
+                -i ${WORK_DIR}/20news-test-vectors
+                -m ${WORK_DIR}/model 
+                -l ${WORK_DIR}/labelindex 
+                -ow 
+                -o ${WORK_DIR}/20news-testing 
+                -c
+
+ 
+       
\ No newline at end of file

Added: 
mahout/site/mahout_cms/trunk/content/users/mapreduce/clustering/20newsgroups.mdtext
URL: 
http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/mapreduce/clustering/20newsgroups.mdtext?rev=1667878&view=auto
==============================================================================
--- 
mahout/site/mahout_cms/trunk/content/users/mapreduce/clustering/20newsgroups.mdtext
 (added)
+++ 
mahout/site/mahout_cms/trunk/content/users/mapreduce/clustering/20newsgroups.mdtext
 Thu Mar 19 21:21:28 2015
@@ -0,0 +1,5 @@
+Title: 20Newsgroups
+<a name="20Newsgroups-NaiveBayesusing20NewsgroupsData"></a>
+# Naive Bayes using 20 Newsgroups Data
+
+See 
[https://issues.apache.org/jira/browse/MAHOUT-9](https://issues.apache.org/jira/browse/MAHOUT-9)

svn commit: r1667878 [1/4] - in /mahout/site/mahout_cms/trunk: content/users/algorithms/ content/users/environment/ content/users/mapreduce/ content/users/mapreduce/classification/ content/users/mapreduce/clustering/ content/users/mapreduce/recommender...

Reply via email to