[20/51] [partial] mahout git commit: New Website courtesy of startbootstrap.com

rawkintrevo Fri, 01 Dec 2017 22:09:18 -0800

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/classification/bankmarketing-example.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/tutorials/map-reduce/classification/bankmarketing-example.md 
b/website-old/docs/tutorials/map-reduce/classification/bankmarketing-example.md
new file mode 100644
index 0000000..e348961
--- /dev/null
+++ 
b/website-old/docs/tutorials/map-reduce/classification/bankmarketing-example.md
@@ -0,0 +1,53 @@
+---
+layout: tutorial
+title: (Deprecated) 
+theme:
+    name: retro-mahout
+---
+
+Notice:    Licensed to the Apache Software Foundation (ASF) under one
+           or more contributor license agreements.  See the NOTICE file
+           distributed with this work for additional information
+           regarding copyright ownership.  The ASF licenses this file
+           to you under the Apache License, Version 2.0 (the
+           "License"); you may not use this file except in compliance
+           with the License.  You may obtain a copy of the License at
+           .
+             http://www.apache.org/licenses/LICENSE-2.0
+           .
+           Unless required by applicable law or agreed to in writing,
+           software distributed under the License is distributed on an
+           "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+           KIND, either express or implied.  See the License for the
+           specific language governing permissions and limitations
+           under the License.
+
+#Bank Marketing Example
+
+### Introduction
+
+This page describes how to run Mahout's SGD classifier on the [UCI Bank 
Marketing dataset](http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing).
+The goal is to predict if the client will subscribe a term deposit offered via 
a phone call. The features in the dataset consist
+of information such as age, job, marital status as well as information about 
the last contacts from the bank.
+
+### Code & Data
+
+The bank marketing example code lives under 
+
+*mahout-examples/src/main/java/org.apache.mahout.classifier.sgd.bankmarketing*
+
+The data can be found at 
+
+*mahout-examples/src/main/resources/bank-full.csv*
+
+### Code details
+
+This example consists of 3 classes:
+
+  - BankMarketingClassificationMain
+  - TelephoneCall
+  - TelephoneCallParser
+
+When you run the main method of BankMarketingClassificationMain it parses the 
dataset using the TelephoneCallParser and trains
+a logistic regression model with 20 runs and 20 passes. The 
TelephoneCallParser uses Mahout's feature vector encoder
+to encode the features in the dataset into a vector. Afterwards the model is 
tested and the learning rate and AUC is printed accuracy is printed to standard 
output.
\ No newline at end of file


http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/classification/breiman-example.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/tutorials/map-reduce/classification/breiman-example.md 
b/website-old/docs/tutorials/map-reduce/classification/breiman-example.md
new file mode 100644
index 0000000..32f8c44
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/classification/breiman-example.md
@@ -0,0 +1,67 @@
+---
+layout: tutorial
+title: (Deprecated)  Breiman Example
+theme:
+    name: retro-mahout
+---
+
+#Breiman Example
+
+#### Introduction
+
+This page describes how to run the Breiman example, which implements the test 
procedure described in [Leo Breiman's 
paper](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.23.3999&rep=rep1&type=pdf).
 The basic algorithm is as follows :
+
+ * repeat *I* iterations
+ * in each iteration do
+  * keep 10% of the dataset apart as a testing set 
+  * build two forests using the training set, one with *m = int(log2(M) + 1)* 
(called Random-Input) and one with *m = 1* (called Single-Input)
+  * choose the forest that gave the lowest oob error estimation to compute
+the test set error
+  * compute the test set error using the Single Input Forest (test error),
+this demonstrates that even with *m = 1*, Decision Forests give comparable
+results to greater values of *m*
+  * compute the mean testset error using every tree of the chosen forest
+(tree error). This should indicate how well a single Decision Tree performs
+ * compute the mean test error for all iterations
+ * compute the mean tree error for all iterations
+
+
+#### Running the Example
+
+The current implementation is compatible with the [UCI 
repository](http://archive.ics.uci.edu/ml/) file format. We'll show how to run 
this example on two datasets:
+
+First, we deal with [Glass 
Identification](http://archive.ics.uci.edu/ml/datasets/Glass+Identification): 
download the 
[dataset](http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data)
 file called **glass.data** and store it onto your local machine. Next, we must 
generate the descriptor file **glass.info** for this dataset with the following 
command:
+
+    bin/mahout org.apache.mahout.classifier.df.tools.Describe -p 
/path/to/glass.data -f /path/to/glass.info -d I 9 N L
+
+Substitute */path/to/* with the folder where you downloaded the dataset, the 
argument "I 9 N L" indicates the nature of the variables. Here it means 1
+ignored (I) attribute, followed by 9 numerical(N) attributes, followed by
+the label (L).
+
+Finally, we build and evaluate our random forest classifier as follows:
+
+    bin/mahout org.apache.mahout.classifier.df.BreimanExample -d 
/path/to/glass.data -ds /path/to/glass.info -i 10 -t 100
+which builds 100 trees (-t argument) and repeats the test 10 iterations (-i
+argument) 
+
+The example outputs the following results:
+
+ * Selection error: mean test error for the selected forest on all iterations
+ * Single Input error: mean test error for the single input forest on all
+iterations
+ * One Tree error: mean single tree error on all iterations
+ * Mean Random Input Time: mean build time for random input forests on all
+iterations
+ * Mean Single Input Time: mean build time for single input forests on all
+iterations
+
+We can repeat this for a 
[Sonar](http://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+%28Sonar,+Mines+vs.+Rocks%29)
 usecase: download the 
[dataset](http://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data)
 file called **sonar.all-data** and store it onto your local machine. Generate 
the descriptor file **sonar.info** for this dataset with the following command:
+
+    bin/mahout org.apache.mahout.classifier.df.tools.Describe -p 
/path/to/sonar.all-data -f /path/to/sonar.info -d 60 N L
+
+The argument "60 N L" means 60 numerical(N) attributes, followed by the label 
(L). Analogous to the previous case, we run the evaluation as follows:
+
+    bin/mahout org.apache.mahout.classifier.df.BreimanExample -d 
/path/to/sonar.all-data -ds /path/to/sonar.info -i 10 -t 100
+
+
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/classification/twenty-newsgroups.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/tutorials/map-reduce/classification/twenty-newsgroups.md 
b/website-old/docs/tutorials/map-reduce/classification/twenty-newsgroups.md
new file mode 100644
index 0000000..2226e94
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/classification/twenty-newsgroups.md
@@ -0,0 +1,179 @@
+---
+layout: tutorial
+title: (Deprecated)  Twenty Newsgroups
+theme:
+    name: retro-mahout
+---
+
+
+<a name="TwentyNewsgroups-TwentyNewsgroupsClassificationExample"></a>
+## Twenty Newsgroups Classification Example
+
+<a name="TwentyNewsgroups-Introduction"></a>
+## Introduction
+
+The 20 newsgroups dataset is a collection of approximately 20,000
+newsgroup documents, partitioned (nearly) evenly across 20 different
+newsgroups. The 20 newsgroups collection has become a popular data set for
+experiments in text applications of machine learning techniques, such as
+text classification and text clustering. We will use the [Mahout 
CBayes](http://mahout.apache.org/users/mapreduce/classification/bayesian.html)
+classifier to create a model that would classify a new document into one of
+the 20 newsgroups.
+
+<a name="TwentyNewsgroups-Prerequisites"></a>
+### Prerequisites
+
+* Mahout has been downloaded ([instructions 
here](https://mahout.apache.org/general/downloads.html))
+* Maven is available
+* Your environment has the following variables:
+     * **HADOOP_HOME** Environment variables refers to where Hadoop lives 
+     * **MAHOUT_HOME** Environment variables refers to where Mahout lives
+
+<a name="TwentyNewsgroups-Instructionsforrunningtheexample"></a>
+### Instructions for running the example
+
+1. If running Hadoop in cluster mode, start the hadoop daemons by executing 
the following commands:
+
+            $ cd $HADOOP_HOME/bin
+            $ ./start-all.sh
+   
+    Otherwise:
+
+            $ export MAHOUT_LOCAL=true
+
+2. In the trunk directory of Mahout, compile and install Mahout:
+
+            $ cd $MAHOUT_HOME
+            $ mvn -DskipTests clean install
+
+3. Run the [20 newsgroups example 
script](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh)
 by executing:
+
+            $ ./examples/bin/classify-20newsgroups.sh
+
+4. You will be prompted to select a classification method algorithm: 
+    
+            1. Complement Naive Bayes
+            2. Naive Bayes
+            3. Stochastic Gradient Descent
+
+Select 1 and the the script will perform the following:
+
+1. Create a working directory for the dataset and all input/output.
+2. Download and extract the *20news-bydate.tar.gz* from the [20 newsgroups 
dataset](http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz) 
to the working directory.
+3. Convert the full 20 newsgroups dataset into a < Text, Text > SequenceFile. 
+4. Convert and preprocesses the dataset into a < Text, VectorWritable > 
SequenceFile containing term frequencies for each document.
+5. Split the preprocessed dataset into training and testing sets. 
+6. Train the classifier.
+7. Test the classifier.
+
+
+Output should look something like:
+
+
+    =======================================================
+    Confusion Matrix
+    -------------------------------------------------------
+     a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t <--Classified 
as
+    381 0  0  0  0  9  1  0  0  0  1  0  0  2  0  1  0  0  3  0 |398 
a=rec.motorcycles
+     1 284 0  0  0  0  1  0  6  3  11 0  66 3  0  6  0  4  9  0 |395 
b=comp.windows.x
+     2  0 339 2  0  3  5  1  0  0  0  0  1  1  12 1  7  0  2  0 |376 
c=talk.politics.mideast
+     4  0  1 327 0  2  2  0  0  2  1  1  0  5  1  4  12 0  2  0 |364 
d=talk.politics.guns
+     7  0  4  32 27 7  7  2  0  12 0  0  6  0 100 9  7  31 0  0 |251 
e=talk.religion.misc
+     10 0  0  0  0 359 2  2  0  0  3  0  1  6  0  1  0  0  11 0 |396 
f=rec.autos
+     0  0  0  0  0  1 383 9  1  0  0  0  0  0  0  0  0  3  0  0 |397 
g=rec.sport.baseball
+     1  0  0  0  0  0  9 382 0  0  0  0  1  1  1  0  2  0  2  0 |399 
h=rec.sport.hockey
+     2  0  0  0  0  4  3  0 330 4  4  0  5  12 0  0  2  0  12 7 |385 
i=comp.sys.mac.hardware
+     0  3  0  0  0  0  1  0  0 368 0  0  10 4  1  3  2  0  2  0 |394 
j=sci.space
+     0  0  0  0  0  3  1  0  27 2 291 0  11 25 0  0  1  0  13 18|392 
k=comp.sys.ibm.pc.hardware
+     8  0  1 109 0  6  11 4  1  18 0  98 1  3  11 10 27 1  1  0 |310 
l=talk.politics.misc
+     0  11 0  0  0  3  6  0  10 6  11 0 299 13 0  2  13 0  7  8 |389 
m=comp.graphics
+     6  0  1  0  0  4  2  0  5  2  12 0  8 321 0  4  14 0  8  6 |393 
n=sci.electronics
+     2  0  0  0  0  0  4  1  0  3  1  0  3  1 372 6  0  2  1  2 |398 
o=soc.religion.christian
+     4  0  0  1  0  2  3  3  0  4  2  0  7  12 6 342 1  0  9  0 |396 p=sci.med
+     0  1  0  1  0  1  4  0  3  0  1  0  8  4  0  2 369 0  1  1 |396 
q=sci.crypt
+     10 0  4  10 1  5  6  2  2  6  2  0  2  1 86 15 14 152 0  1 |319 
r=alt.atheism
+     4  0  0  0  0  9  1  1  8  1  12 0  3  0  2  0  0  0 341 2 |390 
s=misc.forsale
+     8  5  0  0  0  1  6  0  8  5  50 0  40 2  1  0  9  0  3 256|394 
t=comp.os.ms-windows.misc
+    =======================================================
+    Statistics
+    -------------------------------------------------------
+    Kappa                                       0.8808
+    Accuracy                                   90.8596%
+    Reliability                                86.3632%
+    Reliability (standard deviation)            0.2131
+
+
+
+
+
+<a name="TwentyNewsgroups-ComplementaryNaiveBayes"></a>
+## End to end commands to build a CBayes model for 20 newsgroups
+The [20 newsgroups example 
script](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh)
 issues the following commands as outlined above. We can build a CBayes 
classifier from the command line by following the process in the script: 
+
+*Be sure that **MAHOUT_HOME**/bin and **HADOOP_HOME**/bin are in your 
**$PATH***
+
+1. Create a working directory for the dataset and all input/output.
+           
+            $ export WORK_DIR=/tmp/mahout-work-${USER}
+            $ mkdir -p ${WORK_DIR}
+
+2. Download and extract the *20news-bydate.tar.gz* from the [20newsgroups 
dataset](http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz) 
to the working directory.
+
+            $ curl 
http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz 
+                -o ${WORK_DIR}/20news-bydate.tar.gz
+            $ mkdir -p ${WORK_DIR}/20news-bydate
+            $ cd ${WORK_DIR}/20news-bydate && tar xzf ../20news-bydate.tar.gz 
&& cd .. && cd ..
+            $ mkdir ${WORK_DIR}/20news-all
+            $ cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all
+     * If you're running on a Hadoop cluster:
+ 
+            $ hadoop dfs -put ${WORK_DIR}/20news-all ${WORK_DIR}/20news-all
+
+3. Convert the full 20 newsgroups dataset into a < Text, Text > SequenceFile. 
+          
+            $ mahout seqdirectory 
+                -i ${WORK_DIR}/20news-all 
+                -o ${WORK_DIR}/20news-seq 
+                -ow
+            
+4. Convert and preprocesses the dataset into  a < Text, VectorWritable > 
SequenceFile containing term frequencies for each document. 
+            
+            $ mahout seq2sparse 
+                -i ${WORK_DIR}/20news-seq 
+                -o ${WORK_DIR}/20news-vectors
+                -lnorm 
+                -nv 
+                -wt tfidf
+If we wanted to use different parsing methods or transformations on the term 
frequency vectors we could supply different options here e.g.: -ng 2 for 
bigrams or -n 2 for L2 length normalization.  See the [Creating vectors from 
text](http://mahout.apache.org/users/basics/creating-vectors-from-text.html) 
page for a list of all seq2sparse options.   
+
+5. Split the preprocessed dataset into training and testing sets.
+
+            $ mahout split 
+                -i ${WORK_DIR}/20news-vectors/tfidf-vectors 
+                --trainingOutput ${WORK_DIR}/20news-train-vectors 
+                --testOutput ${WORK_DIR}/20news-test-vectors  
+                --randomSelectionPct 40 
+                --overwrite --sequenceFiles -xm sequential
+ 
+6. Train the classifier.
+
+            $ mahout trainnb 
+                -i ${WORK_DIR}/20news-train-vectors
+                -el  
+                -o ${WORK_DIR}/model 
+                -li ${WORK_DIR}/labelindex 
+                -ow 
+                -c
+
+7. Test the classifier.
+
+            $ mahout testnb 
+                -i ${WORK_DIR}/20news-test-vectors
+                -m ${WORK_DIR}/model 
+                -l ${WORK_DIR}/labelindex 
+                -ow 
+                -o ${WORK_DIR}/20news-testing 
+                -c
+
+ 
+       
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/classification/wikipedia-classifier-example.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/tutorials/map-reduce/classification/wikipedia-classifier-example.md
 
b/website-old/docs/tutorials/map-reduce/classification/wikipedia-classifier-example.md
new file mode 100644
index 0000000..ab80054
--- /dev/null
+++ 
b/website-old/docs/tutorials/map-reduce/classification/wikipedia-classifier-example.md
@@ -0,0 +1,57 @@
+---
+layout: tutorial
+title: (Deprecated)  Wikipedia XML parser and Naive Bayes Example
+theme:
+    name: retro-mahout
+---
+# Wikipedia XML parser and Naive Bayes Classifier Example
+
+## Introduction
+Mahout has an [example 
script](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh)
 [1] which will download a recent XML dump of the (entire if desired) [English 
Wikipedia database](http://dumps.wikimedia.org/enwiki/latest/). After running 
the classification script, you can use the [document classification 
script](https://github.com/apache/mahout/blob/master/examples/bin/spark-document-classifier.mscala)
 from the Mahout 
[spark-shell](http://mahout.apache.org/users/sparkbindings/play-with-shell.html)
 to vectorize and classify text from outside of the training and testing corpus 
using a modle built on the Wikipedia dataset.  
+
+You can run this script to build and test a Naive Bayes classifier for option 
(1) 10 arbitrary countries or option (2) 2 countries (United States and United 
Kingdom).
+
+## Oververview
+
+Tou run the example simply execute the 
`$MAHOUT_HOME/examples/bin/classify-wikipedia.sh` script.
+
+By defult the script is set to run on a medium sized Wikipedia XML dump.  To 
run on the full set (the entire english Wikipedia) you can change the download 
by commenting out line 78, and uncommenting line 80  of 
[classify-wikipedia.sh](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh)
 [1]. However this is not recommended unless you have the resources to do so. 
*Be sure to clean your work directory when changing datasets- option (3).*
+
+The step by step process for Creating a Naive Bayes Classifier for the 
Wikipedia XML dump is very similar to that for [creating a 20 Newsgroups 
Classifier](http://mahout.apache.org/users/classification/twenty-newsgroups.html)
 [4].  The only difference being that instead of running `$mahout seqdirectory` 
on the unzipped 20 Newsgroups file, you'll run `$mahout seqwiki` on the 
unzipped Wikipedia xml dump.
+
+    $ mahout seqwiki 
+
+The above command launches `WikipediaToSequenceFile.java` which accepts a text 
file of categories [3] and starts an MR job to parse the each document in the 
XML file.  This process will seek to extract documents with a wikipedia 
category tag which (exactly, if the `-exactMatchOnly` option is set) matches a 
line in the category file.  If no match is found and the `-all` option is set, 
the document will be dumped into an "unknown" category. The documents will then 
be written out as a `<Text,Text>` sequence file of the form 
(K:/category/document_title , V: document).
+
+There are 3 different example category files available to in the 
/examples/src/test/resources
+directory:  country.txt, country10.txt and country2.txt.  You can edit these 
categories to extract a different corpus from the Wikipedia dataset.
+
+The CLI options for `seqwiki` are as follows:
+
+    --input          (-i)         input pathname String
+    --output         (-o)         the output pathname String
+    --categories     (-c)         the file containing the Wikipedia categories
+    --exactMatchOnly (-e)         if set, then the Wikipedia category must 
match
+                                    exactly instead of simply containing the 
category string
+    --all            (-all)       if set select all categories
+    --removeLabels   (-rl)        if set, remove [[Category:labels]] from 
document text after extracting label.
+
+
+After `seqwiki`, the script runs `seq2sparse`, `split`, `trainnb` and `testnb` 
as in the [step by step 20newsgroups 
example](http://mahout.apache.org/users/classification/twenty-newsgroups.html). 
 When all of the jobs have finished, a confusion matrix will be displayed.
+
+#Resourcese
+
+[1] 
[classify-wikipedia.sh](https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh)
+
+[2] [Document classification script for the Mahout Spark 
Shell](https://github.com/apache/mahout/blob/master/examples/bin/spark-document-classifier.mscala)
+
+[3] [Example category 
file](https://github.com/apache/mahout/blob/master/examples/src/test/resources/country10.txt)
+
+[4] [Step by step instructions for building a Naive Bayes classifier for 
20newsgroups from the command 
line](http://mahout.apache.org/users/classification/twenty-newsgroups.html)
+
+[5] [Mahout MapReduce Naive 
Bayes](http://mahout.apache.org/users/classification/bayesian.html)
+
+[6] [Mahout Spark Naive 
Bayes](http://mahout.apache.org/users/algorithms/spark-naive-bayes.html)
+
+[7] [Mahout Scala Spark and H2O 
Bindings](http://mahout.apache.org/users/sparkbindings/home.html)
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/clustering/20newsgroups.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/map-reduce/clustering/20newsgroups.md 
b/website-old/docs/tutorials/map-reduce/clustering/20newsgroups.md
new file mode 100644
index 0000000..e39d989
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/clustering/20newsgroups.md
@@ -0,0 +1,11 @@
+---
+layout: tutorial
+title: (Deprecated)  20Newsgroups
+theme:
+   name: retro-mahout
+---
+
+<a name="20Newsgroups-NaiveBayesusing20NewsgroupsData"></a>
+# Naive Bayes using 20 Newsgroups Data
+
+See 
[https://issues.apache.org/jira/browse/MAHOUT-9](https://issues.apache.org/jira/browse/MAHOUT-9)

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/clustering/canopy-commandline.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/tutorials/map-reduce/clustering/canopy-commandline.md 
b/website-old/docs/tutorials/map-reduce/clustering/canopy-commandline.md
new file mode 100644
index 0000000..e7f2b21
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/clustering/canopy-commandline.md
@@ -0,0 +1,70 @@
+---
+layout: tutorial
+title: (Deprecated)  canopy-commandline
+theme:
+   name: retro-mahout
+---
+
+<a name="canopy-commandline-RunningCanopyClusteringfromtheCommandLine"></a>
+# Running Canopy Clustering from the Command Line
+Mahout's Canopy clustering can be launched from the same command line
+invocation whether you are running on a single machine in stand-alone mode
+or on a larger Hadoop cluster. The difference is determined by the
+$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to
+an operating Hadoop cluster on the target machine then the invocation will
+run Canopy on that cluster. If either of the environment variables are
+missing then the stand-alone Hadoop configuration will be invoked instead.
+
+
+    ./bin/mahout canopy <OPTIONS>
+
+
+* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.job
+
+
+<a name="canopy-commandline-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster
+
+* Put the data: cp <PATH TO DATA> testdata
+* Run the Job: 
+
+    ./bin/mahout canopy -i testdata -o output -dm
+org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 5 -t2 2
+
+
+<a name="canopy-commandline-Runningitonthecluster"></a>
+## Running it on the cluster
+
+* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+* Run the Job: 
+
+    export HADOOP_HOME=<Hadoop Home Directory>
+    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
+    ./bin/mahout canopy -i testdata -o output -dm
+org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 5 -t2 2
+
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="canopy-commandline-Commandlineoptions"></a>
+# Command line options
+
+      --input (-i) input                            Path to job input 
directory.Must  
+                                            be a SequenceFile of           
+                                            VectorWritable                 
+      --output (-o) output                          The directory pathname for 
output. 
+      --overwrite (-ow)                             If present, overwrite the 
output    
+                                            directory before running job   
+      --distanceMeasure (-dm) distanceMeasure    The classname of the      
+                                            DistanceMeasure. Default is    
+                                            SquaredEuclidean               
+      --t1 (-t1) t1                         T1 threshold value             
+      --t2 (-t2) t2                         T2 threshold value             
+      --clustering (-cl)                            If present, run clustering 
after   
+                                            the iterations have taken place    
 
+      --help (-h)                                   Print out help             
    
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/clustering/clustering-of-synthetic-control-data.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/tutorials/map-reduce/clustering/clustering-of-synthetic-control-data.md
 
b/website-old/docs/tutorials/map-reduce/clustering/clustering-of-synthetic-control-data.md
new file mode 100644
index 0000000..201e9d8
--- /dev/null
+++ 
b/website-old/docs/tutorials/map-reduce/clustering/clustering-of-synthetic-control-data.md
@@ -0,0 +1,53 @@
+---
+layout: tutorial
+title: (Deprecated)  Clustering of synthetic control data
+theme:
+   name: retro-mahout
+---
+
+# Clustering synthetic control data
+
+## Introduction
+
+This example will demonstrate clustering of time series data, specifically 
control charts. [Control charts](http://en.wikipedia.org/wiki/Control_chart) 
are tools used to determine whether a manufacturing or business process is in a 
state of statistical control. Such control charts are generated / simulated 
repeatedly at equal time intervals. A [simulated 
dataset](http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html)
 is available for use in UCI machine learning repository.
+
+A time series of control charts needs to be clustered into their close knit 
groups. The data set we use is synthetic and is meant to resemble real world 
information in an anonymized format. It contains six different classes: Normal, 
Cyclic, Increasing trend, Decreasing trend, Upward shift, Downward shift. In 
this example we will use Mahout to cluster the data into corresponding class 
buckets. 
+
+*For the sake of simplicity, we won't use a cluster in this example, but 
instead show you the commands to run the clustering examples locally with 
Hadoop*.
+
+## Setup
+
+We need to do some initial setup before we are able to run the example. 
+
+
+  1. Start out by downloading the dataset to be clustered from the UCI Machine 
Learning Repository: 
[http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data](http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data).
+
+  2. Download the [latest release of Mahout](/general/downloads.html).
+
+  3. Unpack the release binary and switch to the *mahout-distribution-0.x* 
folder
+
+  4. Make sure that the *JAVA_HOME* environment variable points to your local 
java installation
+
+  5. Create a folder called *testdata* in the current directory and copy the 
dataset into this folder.
+
+
+## Clustering Examples
+
+Depending on the clustering algorithm you want to run, the following commands 
can be used:
+
+
+   * [Canopy Clustering](/users/clustering/canopy-clustering.html)
+
+    bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job
+
+   * [k-Means Clustering](/users/clustering/k-means-clustering.html)
+
+    bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
+
+
+   * [Fuzzy k-Means Clustering](/users/clustering/fuzzy-k-means.html)
+
+    bin/mahout org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job
+
+The clustering output will be produced in the *output* directory. The output 
data points are in vector format. In order to read/analyze the output, you can 
use the [clusterdump](/users/clustering/cluster-dumper.html) utility provided 
by Mahout.
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/clustering/clustering-seinfeld-episodes.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/tutorials/map-reduce/clustering/clustering-seinfeld-episodes.md
 
b/website-old/docs/tutorials/map-reduce/clustering/clustering-seinfeld-episodes.md
new file mode 100644
index 0000000..35b2008
--- /dev/null
+++ 
b/website-old/docs/tutorials/map-reduce/clustering/clustering-seinfeld-episodes.md
@@ -0,0 +1,11 @@
+---
+layout: tutorial
+title: (Deprecated)  Clustering Seinfeld Episodes
+theme:
+   name: retro-mahout
+---
+
+Below is short tutorial on how to cluster Seinfeld episode transcripts with
+Mahout.
+
+http://blog.jteam.nl/2011/04/04/how-to-cluster-seinfeld-episodes-with-mahout/

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/clustering/clusteringyourdata.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/tutorials/map-reduce/clustering/clusteringyourdata.md 
b/website-old/docs/tutorials/map-reduce/clustering/clusteringyourdata.md
new file mode 100644
index 0000000..ba7cb0b
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/clustering/clusteringyourdata.md
@@ -0,0 +1,126 @@
+---
+layout: tutorial
+title: (Deprecated)  ClusteringYourData
+theme:
+   name: retro-mahout
+---
+
+# Clustering your data
+
+After you've done the [Quickstart](quickstart.html) and are familiar with the 
basics of Mahout, it is time to cluster your own
+data. See also [Wikipedia on cluster 
analysis](en.wikipedia.org/wiki/Cluster_analysis) for more background.
+
+The following pieces *may* be useful for in getting started:
+
+<a name="ClusteringYourData-Input"></a>
+# Input
+
+For starters, you will need your data in an appropriate Vector format, see 
[Creating Vectors](../basics/creating-vectors.html).
+In particular for text preparation check out [Creating Vectors from 
Text](../basics/creating-vectors-from-text.html).
+
+
+<a name="ClusteringYourData-RunningtheProcess"></a>
+# Running the Process
+
+* [Canopy background](canopy-clustering.html) and 
[canopy-commandline](canopy-commandline.html).
+
+* [K-Means background](k-means-clustering.html), 
[k-means-commandline](k-means-commandline.html), and
+[fuzzy-k-means-commandline](fuzzy-k-means-commandline.html).
+
+* [Dirichlet background](dirichlet-process-clustering.html) and 
[dirichlet-commandline](dirichlet-commandline.html).
+
+* [Meanshift background](mean-shift-clustering.html) and 
[mean-shift-commandline](mean-shift-commandline.html).
+
+* [LDA (Latent Dirichlet Allocation) 
background](-latent-dirichlet-allocation.html) and 
[lda-commandline](lda-commandline.html).
+
+* TODO: kmeans++/ streaming kMeans documentation
+
+
+<a name="ClusteringYourData-RetrievingtheOutput"></a>
+# Retrieving the Output
+
+Mahout has a cluster dumper utility that can be used to retrieve and evaluate 
your clustering data.
+
+    ./bin/mahout clusterdump <OPTIONS>
+
+
+<a name="ClusteringYourData-Theclusterdumperoptionsare:"></a>
+## The cluster dumper options are:
+
+      --help (-h)                                 Print out help       
+           
+      --input (-i) input                          The directory containing 
Sequence    
+                                          Files for the Clusters           
+
+      --output (-o) output                        The output file.  If not 
specified,  
+                                          dumps to the console.
+
+      --outputFormat (-of) outputFormat           The optional output format 
to write
+                                          the results as. Options: TEXT, CSV, 
or GRAPH_ML               
+
+      --substring (-b) substring                  The number of chars of the   
    
+                                          asFormatString() to print    
+    
+      --pointsDir (-p) pointsDir                  The directory containing 
points  
+                                          sequence files mapping input vectors 
                                           to their cluster.  If specified, 
+                                          then the program will output the 
+                                          points associated with a cluster 
+
+      --dictionary (-d) dictionary                The dictionary file.         
    
+
+      --dictionaryType (-dt) dictionaryType    The dictionary file type        
    
+                                          (text|sequencefile)
+
+      --distanceMeasure (-dm) distanceMeasure  The classname of the 
DistanceMeasure.
+                                          Default is SquaredEuclidean.     
+
+      --numWords (-n) numWords            The number of top terms to print 
+
+      --tempDir tempDir                           Intermediate output directory
+
+      --startPhase startPhase             First phase to run
+
+      --endPhase endPhase                         Last phase to run
+
+      --evaluate (-e)                     Run ClusterEvaluator and 
CDbwEvaluator over the
+                                          input. The output will be appended 
to the rest of
+                                          the output at the end.   
+
+
+More information on using clusterdump utility can be found 
[here](cluster-dumper.html)
+
+<a name="ClusteringYourData-ValidatingtheOutput"></a>
+# Validating the Output
+
+{quote}
+Ted Dunning: A principled approach to cluster evaluation is to measure how 
well the
+cluster membership captures the structure of unseen data.  A natural
+measure for this is to measure how much of the entropy of the data is
+captured by cluster membership.  For k-means and its natural L_2 metric,
+the natural cluster quality metric is the squared distance from the nearest
+centroid adjusted by the log_2 of the number of clusters.  This can be
+compared to the squared magnitude of the original data or the squared
+deviation from the centroid for all of the data.  The idea is that you are
+changing the representation of the data by allocating some of the bits in
+your original representation to represent which cluster each point is in. 
+If those bits aren't made up by the residue being small then your
+clustering is making a bad trade-off.
+
+In the past, I have used other more heuristic measures as well.  One of the
+key characteristics that I would like to see out of a clustering is a
+degree of stability.  Thus, I look at the fractions of points that are
+assigned to each cluster or the distribution of distances from the cluster
+centroid. These values should be relatively stable when applied to held-out
+data.
+
+For text, you can actually compute perplexity which measures how well
+cluster membership predicts what words are used.  This is nice because you
+don't have to worry about the entropy of real valued numbers.
+
+Manual inspection and the so-called laugh test is also important.  The idea
+is that the results should not be so ludicrous as to make you laugh.
+Unfortunately, it is pretty easy to kid yourself into thinking your system
+is working using this kind of inspection.  The problem is that we are too
+good at seeing (making up) patterns.
+{quote}
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/clustering/fuzzy-k-means-commandline.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/tutorials/map-reduce/clustering/fuzzy-k-means-commandline.md 
b/website-old/docs/tutorials/map-reduce/clustering/fuzzy-k-means-commandline.md
new file mode 100644
index 0000000..721256d
--- /dev/null
+++ 
b/website-old/docs/tutorials/map-reduce/clustering/fuzzy-k-means-commandline.md
@@ -0,0 +1,97 @@
+---
+layout: tutorial
+title: (Deprecated)  fuzzy-k-means-commandline
+theme:
+   name: retro-mahout
+---
+
+<a 
name="fuzzy-k-means-commandline-RunningFuzzyk-MeansClusteringfromtheCommandLine"></a>
+# Running Fuzzy k-Means Clustering from the Command Line
+Mahout's Fuzzy k-Means clustering can be launched from the same command
+line invocation whether you are running on a single machine in stand-alone
+mode or on a larger Hadoop cluster. The difference is determined by the
+$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to
+an operating Hadoop cluster on the target machine then the invocation will
+run FuzzyK on that cluster. If either of the environment variables are
+missing then the stand-alone Hadoop configuration will be invoked instead.
+
+
+    ./bin/mahout fkmeans <OPTIONS>
+
+
+* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.job
+
+
+<a name="fuzzy-k-means-commandline-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster
+
+* Put the data: cp <PATH TO DATA> testdata
+* Run the Job: 
+
+    ./bin/mahout fkmeans -i testdata <OPTIONS>
+
+
+<a name="fuzzy-k-means-commandline-Runningitonthecluster"></a>
+## Running it on the cluster
+
+* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+* Run the Job: 
+
+    export HADOOP_HOME=<Hadoop Home Directory>
+    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
+    ./bin/mahout fkmeans -i testdata <OPTIONS>
+
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="fuzzy-k-means-commandline-Commandlineoptions"></a>
+# Command line options
+
+      --input (-i) input                              Path to job input 
directory. 
+                                              Must be a SequenceFile of    
+                                              VectorWritable               
+      --clusters (-c) clusters                The input centroids, as Vectors. 
+                                              Must be a SequenceFile of    
+                                              Writable, Cluster/Canopy. If k  
+                                              is also specified, then a random 
+                                              set of vectors will be selected  
+                                              and written out to this path 
+                                              first                        
+      --output (-o) output                            The directory pathname 
for   
+                                              output.                      
+      --distanceMeasure (-dm) distanceMeasure      The classname of the        
    
+                                              DistanceMeasure. Default is  
+                                              SquaredEuclidean             
+      --convergenceDelta (-cd) convergenceDelta    The convergence delta 
value. 
+                                              Default is 0.5               
+      --maxIter (-x) maxIter                  The maximum number of        
+                                              iterations.                  
+      --k (-k) k                                      The k in k-Means.  If 
specified, 
+                                              then a random selection of k 
+                                              Vectors will be chosen as the
+                                                      Centroid and written to 
the  
+                                              clusters input path.         
+      --m (-m) m                                      coefficient 
normalization    
+                                              factor, must be greater than 1   
+      --overwrite (-ow)                               If present, overwrite 
the output 
+                                              directory before running job 
+      --help (-h)                                     Print out help           
    
+      --numMap (-u) numMap                            The number of map tasks. 
    
+                                              Defaults to 10               
+      --maxRed (-r) maxRed                            The number of reduce 
tasks.  
+                                              Defaults to 2                
+      --emitMostLikely (-e) emitMostLikely            True if clustering 
should emit   
+                                              the most likely point only,  
+                                              false for threshold clustering.  
+                                              Default is true              
+      --threshold (-t) threshold                      The pdf threshold used 
for   
+                                              cluster determination. Default   
+                                              is 0 
+      --clustering (-cl)                              If present, run 
clustering after 
+                                              the iterations have taken place  
+                                
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/clustering/k-means-commandline.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/tutorials/map-reduce/clustering/k-means-commandline.md 
b/website-old/docs/tutorials/map-reduce/clustering/k-means-commandline.md
new file mode 100644
index 0000000..b9ac430
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/clustering/k-means-commandline.md
@@ -0,0 +1,94 @@
+---
+layout: tutorial
+title: (Deprecated)  k-means-commandline
+theme:
+   name: retro-mahout
+---
+
+<a name="k-means-commandline-Introduction"></a>
+# kMeans commandline introduction
+
+This quick start page describes how to run the kMeans clustering algorithm
+on a Hadoop cluster. 
+
+<a name="k-means-commandline-Steps"></a>
+# Steps
+
+Mahout's k-Means clustering can be launched from the same command line
+invocation whether you are running on a single machine in stand-alone mode
+or on a larger Hadoop cluster. The difference is determined by the
+$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to
+an operating Hadoop cluster on the target machine then the invocation will
+run k-Means on that cluster. If either of the environment variables are
+missing then the stand-alone Hadoop configuration will be invoked instead.
+
+
+    ./bin/mahout kmeans <OPTIONS>
+
+
+In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.job
+
+
+<a name="k-means-commandline-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster
+
+* Put the data: cp <PATH TO DATA> testdata
+* Run the Job: 
+
+    ./bin/mahout kmeans -i testdata -o output -c clusters -dm
+org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k
+25
+
+
+<a name="k-means-commandline-Runningitonthecluster"></a>
+## Running it on the cluster
+
+* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+* Run the Job: 
+
+    export HADOOP_HOME=<Hadoop Home Directory>
+    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
+    ./bin/mahout kmeans -i testdata -o output -c clusters -dm 
org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k 25
+
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="k-means-commandline-Commandlineoptions"></a>
+# Command line options
+
+      --input (-i) input                              Path to job input 
directory. 
+                                              Must be a SequenceFile of    
+                                              VectorWritable               
+      --clusters (-c) clusters                The input centroids, as Vectors. 
+                                              Must be a SequenceFile of    
+                                              Writable, Cluster/Canopy. If k  
+                                              is also specified, then a random 
+                                              set of vectors will be selected  
+                                              and written out to this path 
+                                              first                        
+      --output (-o) output                            The directory pathname 
for   
+                                              output.                      
+      --distanceMeasure (-dm) distanceMeasure      The classname of the        
    
+                                              DistanceMeasure. Default is  
+                                              SquaredEuclidean             
+      --convergenceDelta (-cd) convergenceDelta    The convergence delta 
value. 
+                                              Default is 0.5               
+      --maxIter (-x) maxIter                  The maximum number of        
+                                              iterations.                  
+      --maxRed (-r) maxRed                            The number of reduce 
tasks.  
+                                              Defaults to 2                
+      --k (-k) k                                      The k in k-Means.  If 
specified, 
+                                              then a random selection of k 
+                                              Vectors will be chosen as the    
+                                              Centroid and written to the  
+                                              clusters input path.         
+      --overwrite (-ow)                               If present, overwrite 
the output 
+                                              directory before running job 
+      --help (-h)                                     Print out help           
    
+      --clustering (-cl)                              If present, run 
clustering after 
+                                              the iterations have taken place  
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/clustering/lda-commandline.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/tutorials/map-reduce/clustering/lda-commandline.md 
b/website-old/docs/tutorials/map-reduce/clustering/lda-commandline.md
new file mode 100644
index 0000000..6b10681
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/clustering/lda-commandline.md
@@ -0,0 +1,83 @@
+---
+layout: tutorial
+title: (Deprecated)  lda-commandline
+theme:
+   name: retro-mahout
+---
+
+<a 
name="lda-commandline-RunningLatentDirichletAllocation(algorithm)fromtheCommandLine"></a>
+# Running Latent Dirichlet Allocation (algorithm) from the Command Line
+[Since Mahout v0.6](https://issues.apache.org/jira/browse/MAHOUT-897)
+ lda has been implemented as Collapsed Variable Bayes (cvb). 
+
+Mahout's LDA can be launched from the same command line invocation whether
+you are running on a single machine in stand-alone mode or on a larger
+Hadoop cluster. The difference is determined by the $HADOOP_HOME and
+$HADOOP_CONF_DIR environment variables. If both are set to an operating
+Hadoop cluster on the target machine then the invocation will run the LDA
+algorithm on that cluster. If either of the environment variables are
+missing then the stand-alone Hadoop configuration will be invoked instead.
+
+
+
+    ./bin/mahout cvb <OPTIONS>
+
+
+* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.3 release, the
+job will be mahout-core-0.3.job
+
+
+<a name="lda-commandline-Testingitononesinglemachinew/ocluster"></a>
+## Testing it on one single machine w/o cluster
+
+* Put the data: cp <PATH TO DATA> testdata
+* Run the Job: 
+
+    ./bin/mahout cvb -i testdata <OTHER OPTIONS>
+
+
+<a name="lda-commandline-Runningitonthecluster"></a>
+## Running it on the cluster
+
+* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+* Run the Job: 
+
+    export HADOOP_HOME=<Hadoop Home Directory>
+    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
+    ./bin/mahout cvb -i testdata <OTHER OPTIONS>
+
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="lda-commandline-CommandlineoptionsfromMahoutcvbversion0.8"></a>
+# Command line options from Mahout cvb version 0.8
+
+    mahout cvb -h 
+      --input (-i) input                                         Path to job 
input directory.        
+      --output (-o) output                                       The directory 
pathname for output.  
+      --maxIter (-x) maxIter                             The maximum number of 
iterations.             
+      --convergenceDelta (-cd) convergenceDelta                  The 
convergence delta value               
+      --overwrite (-ow)                                          If present, 
overwrite the output directory before running job    
+      --num_topics (-k) num_topics                               Number of 
topics to learn              
+      --num_terms (-nt) num_terms                                Vocabulary 
size   
+      --doc_topic_smoothing (-a) doc_topic_smoothing     Smoothing for 
document/topic distribution          
+      --term_topic_smoothing (-e) term_topic_smoothing   Smoothing for 
topic/term distribution          
+      --dictionary (-dict) dictionary                    Path to 
term-dictionary file(s) (glob expression supported) 
+      --doc_topic_output (-dt) doc_topic_output                  Output path 
for the training doc/topic distribution        
+      --topic_model_temp_dir (-mt) topic_model_temp_dir          Path to 
intermediate model path (useful for restarting)       
+      --iteration_block_size (-block) iteration_block_size       Number of 
iterations per perplexity check  
+      --random_seed (-seed) random_seed                          Random seed   
    
+      --test_set_fraction (-tf) test_set_fraction                Fraction of 
data to hold out for testing  
+      --num_train_threads (-ntt) num_train_threads               number of 
threads per mapper to train with  
+      --num_update_threads (-nut) num_update_threads     number of threads per 
mapper to update the model with        
+      --max_doc_topic_iters (-mipd) max_doc_topic_iters          max number of 
iterations per doc for p(topic|doc) learning              
+      --num_reduce_tasks num_reduce_tasks                        number of 
reducers to use during model estimation        
+      --backfill_perplexity                              enable backfilling of 
missing perplexity values               
+      --help (-h)                                                Print out 
help    
+      --tempDir tempDir                                          Intermediate 
output directory      
+      --startPhase startPhase                            First phase to run    
+      --endPhase endPhase                                        Last phase to 
run
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/clustering/viewing-result.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/map-reduce/clustering/viewing-result.md 
b/website-old/docs/tutorials/map-reduce/clustering/viewing-result.md
new file mode 100644
index 0000000..ce9dd91
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/clustering/viewing-result.md
@@ -0,0 +1,15 @@
+---
+layout: tutorial
+title: (Deprecated)  Viewing Result
+theme:
+   name: retro-mahout
+---
+* [Algorithm Viewing pages](#ViewingResult-AlgorithmViewingpages)
+
+There are various technologies available to view the output of Mahout
+algorithms.
+* Clusters
+
+<a name="ViewingResult-AlgorithmViewingpages"></a>
+# Algorithm Viewing pages
+{pagetree:root=@self|excerpt=true|expandCollapseAll=true}

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/clustering/viewing-results.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/tutorials/map-reduce/clustering/viewing-results.md 
b/website-old/docs/tutorials/map-reduce/clustering/viewing-results.md
new file mode 100644
index 0000000..1b5092f
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/clustering/viewing-results.md
@@ -0,0 +1,49 @@
+---
+layout: tutorial
+title: (Deprecated)  Viewing Results
+theme:
+   name: retro-mahout
+---
+<a name="ViewingResults-Intro"></a>
+# Intro
+
+Many of the Mahout libraries run as batch jobs, dumping results into Hadoop
+sequence files or other data structures.  This page is intended to
+demonstrate the various ways one might inspect the outcome of various jobs.
+ The page is organized by algorithms.
+
+<a name="ViewingResults-GeneralUtilities"></a>
+# General Utilities
+
+<a name="ViewingResults-SequenceFileDumper"></a>
+## Sequence File Dumper
+
+
+<a name="ViewingResults-Clustering"></a>
+# Clustering
+
+<a name="ViewingResults-ClusterDumper"></a>
+## Cluster Dumper
+
+Run the following to print out all options:
+
+    java  -cp "*" org.apache.mahout.utils.clustering.ClusterDumper --help
+
+
+
+<a name="ViewingResults-Example"></a>
+### Example
+
+    java  -cp "*" org.apache.mahout.utils.clustering.ClusterDumper --seqFileDir
+./solr-clust-n2/out/clusters-2
+          --dictionary ./solr-clust-n2/dictionary.txt
+          --substring 100 --pointsDir ./solr-clust-n2/out/points/
+    
+
+
+
+<a name="ViewingResults-ClusterLabels(MAHOUT-163)"></a>
+## Cluster Labels (MAHOUT-163)
+
+<a name="ViewingResults-Classification"></a>
+# Classification

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/clustering/visualizing-sample-clusters.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/tutorials/map-reduce/clustering/visualizing-sample-clusters.md
 
b/website-old/docs/tutorials/map-reduce/clustering/visualizing-sample-clusters.md
new file mode 100644
index 0000000..fe4b93f
--- /dev/null
+++ 
b/website-old/docs/tutorials/map-reduce/clustering/visualizing-sample-clusters.md
@@ -0,0 +1,50 @@
+---
+layout: tutorial
+title: (Deprecated)  Visualizing Sample Clusters
+theme:
+   name: retro-mahout
+---
+
+<a name="VisualizingSampleClusters-Introduction"></a>
+# Introduction
+
+Mahout provides examples to visualize sample clusters that gets created by
+our clustering algorithms. Note that the visualization is done by Swing 
programs. You have to be in a window system on the same
+machine you run these, or logged in via a remote desktop.
+
+For visualizing the clusters, you have to execute the Java
+classes under *org.apache.mahout.clustering.display* package in
+mahout-examples module. The easiest way to achieve this is to [setup 
Mahout](users/basics/quickstart.html) in your IDE.
+
+<a name="VisualizingSampleClusters-Visualizingclusters"></a>
+# Visualizing clusters
+
+The following classes in *org.apache.mahout.clustering.display* can be run
+without parameters to generate a sample data set and run the reference
+clustering implementations over them:
+
+1. **DisplayClustering** - generates 1000 samples from three, symmetric
+distributions. This is the same data set that is used by the following
+clustering programs. It displays the points on a screen and superimposes
+the model parameters that were used to generate the points. You can edit
+the *generateSamples()* method to change the sample points used by these
+programs.
+1. **DisplayClustering** - displays initial areas of generated points
+1. **DisplayCanopy** - uses Canopy clustering
+1. **DisplayKMeans** - uses k-Means clustering
+1. **DisplayFuzzyKMeans** - uses Fuzzy k-Means clustering
+1. **DisplaySpectralKMeans** - uses Spectral KMeans via map-reduce algorithm
+
+If you are using Eclipse, just right-click on each of the classes mentioned 
above and choose "Run As -Java Application". To run these directly from the 
command line:
+
+    cd $MAHOUT_HOME/examples
+    mvn -q exec:java 
-Dexec.mainClass=org.apache.mahout.clustering.display.DisplayClustering
+
+You can substitute other names above for *DisplayClustering*. 
+
+
+Note that some of these programs display the sample points and then 
superimpose all of the clusters from each iteration. The last iteration's 
clusters are in
+bold red and the previous several are colored (orange, yellow, green, blue,
+magenta) in order after which all earlier clusters are in light grey. This
+helps to visualize how the clusters converge upon a solution over multiple
+iterations.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/index.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/map-reduce/index.md 
b/website-old/docs/tutorials/map-reduce/index.md
new file mode 100644
index 0000000..b1a269c
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/index.md
@@ -0,0 +1,19 @@
+---
+layout: tutorial
+title: (Deprecated)  Deprecated Map Reduce Based Examples
+theme:
+    name: mahout2
+---
+
+A note about the sunsetting of our support for Map Reduce.
+
+
+### Classification
+
+[Bank Marketing Example](classification/bankmarketing-example.html)
+
+[Breiman Exampe](classification/breiman-example.html)
+
+[Twenty Newsgroups](classification/twenty-newsgroups.html)
+
+[Wikipedia Classifier 
Example](classification/wikipedia-classifier-example.html)
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/misc/mr---map-reduce.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/map-reduce/misc/mr---map-reduce.md 
b/website-old/docs/tutorials/map-reduce/misc/mr---map-reduce.md
new file mode 100644
index 0000000..a85d0cd
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/misc/mr---map-reduce.md
@@ -0,0 +1,19 @@
+---
+layout: default
+title: (Deprecated)  MR - Map Reduce
+theme:
+   name: retro-mahout
+---
+
+{excerpt}MapReduce is a framework for processing huge datasets on certain
+kinds of distributable problems using a large number of computers (nodes),
+collectively referred to as a cluster.{excerpt} Computational processing
+can occur on data stored either in a filesystem (unstructured) or within a
+database (structured).
+
+&nbsp; Also written M/R
+
+
+&nbsp; See Also
+* 
[http://wiki.apache.org/hadoop/HadoopMapReduce](http://wiki.apache.org/hadoop/HadoopMapReduce)
+* 
[http://en.wikipedia.org/wiki/MapReduce](http://en.wikipedia.org/wiki/MapReduce)

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/misc/parallel-frequent-pattern-mining.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/tutorials/map-reduce/misc/parallel-frequent-pattern-mining.md
 
b/website-old/docs/tutorials/map-reduce/misc/parallel-frequent-pattern-mining.md
new file mode 100644
index 0000000..213c385
--- /dev/null
+++ 
b/website-old/docs/tutorials/map-reduce/misc/parallel-frequent-pattern-mining.md
@@ -0,0 +1,185 @@
+---
+layout: default
+title: (Deprecated)  Parallel Frequent Pattern Mining
+theme:
+    name: retro-mahout
+---
+Mahout has a Top K Parallel FPGrowth Implementation. Its based on the paper 
[http://infolab.stanford.edu/~echang/recsys08-69.pdf](http://infolab.stanford.edu/~echang/recsys08-69.pdf)
+ with some optimisations in mining the data.
+
+Given a huge transaction list, the algorithm finds all unique features(sets
+of field values) and eliminates those features whose frequency in the whole
+dataset is less that minSupport. Using these remaining features N, we find
+the top K closed patterns for each of them, generating a total of NxK
+patterns. FPGrowth Algorithm is a generic implementation, we can use any
+Object type to denote a feature. Current implementation requires you to use
+a String as the object type. You may implement a version for any object by
+creating Iterators, Convertors and TopKPatternWritable for that particular
+object. For more information please refer the package
+org.apache.mahout.fpm.pfpgrowth.convertors.string
+
+    e.g:
+     FPGrowth<String> fp = new FPGrowth<String>();
+     Set<String> features = new HashSet<String>();
+     fp.generateTopKStringFrequentPatterns(
+         new StringRecordIterator(new FileLineIterable(new File(input),
+encoding, false), pattern),
+       fp.generateFList(
+         new StringRecordIterator(new FileLineIterable(new File(input),
+encoding, false), pattern), minSupport),
+        minSupport,
+       maxHeapSize,
+       features,
+       new StringOutputConvertor(new SequenceFileOutputCollector<Text,
+TopKStringPatterns>(writer))
+      );
+
+* The first argument is the iterator of transaction in this case its
+Iterator<List<String>>
+* The second argument is the output of generateFList function, which
+returns the frequent items and their frequencies from the given database
+transaction iterator
+* The third argument is the minimum Support of the pattern to be generated
+* The fourth argument is the maximum number of patterns to be mined for
+each feature
+* The fifth argument is the set of features for which the frequent patterns
+has to be mined
+* The last argument is an output collector which takes \[key, 
value\](key,-value\.html)
+ of Feature and TopK Patterns of the format \[String,
+List<Pair<List<String>, Long>>\] and writes them to the appropriate writer
+class which takes care of storing the object, in this case in a Sequence
+File Output format
+
+<a 
name="ParallelFrequentPatternMining-RunningFrequentPatternGrowthviacommandline"></a>
+## Running Frequent Pattern Growth via command line
+
+The command line launcher for string transaction data
+org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver has other features including
+specifying the regex pattern for spitting a string line of a transaction
+into the constituent features.
+
+Input files have to be in the following format.
+
+<optional document id>TAB<TOKEN1>SPACE<TOKEN2>SPACE....
+
+instead of tab you could use , or \| as the default tokenization is done using 
a java Regex pattern {code}[,\t](,\t.html)
+*[,|\t][ ,\t]*{code}
+You can override this parameter to parse your log files or transaction
+files (each line is a transaction.) The FPGrowth algorithm mines the top K
+frequently occurring sets of items and their counts from the given input
+data
+
+$MAHOUT_HOME/core/src/test/resources/retail.dat is a sample dataset in this
+format. 
+Other sample files are accident.dat.gz from 
[http://fimi.cs.helsinki.fi/data/](http://fimi.cs.helsinki.fi/data/)
+. As a quick test, try this:
+
+
+    bin/mahout fpg \
+         -i core/src/test/resources/retail.dat \
+         -o patterns \
+         -k 50 \
+         -method sequential \
+         -regex '[\ ]
+' \
+         -s 2
+
+
+The minimumSupport parameter \-s is the minimum number of times a pattern
+or a feature needs to occur in the dataset so that it is included in the
+patterns generated. You can speed up the process by having a large value of
+s. There are cases where you will have less than k patterns for a
+particular feature as the rest don't for qualify the minimum support
+criteria
+
+Note that the input to the algorithm, could be uncompressed or compressed
+gz file or even a directory containing any number of such files.
+We modified the regex to use space to split the token. Note that input
+regex string is escaped.
+
+<a name="ParallelFrequentPatternMining-RunningParallelFPGrowth"></a>
+## Running Parallel FPGrowth
+
+Running parallel FPGrowth is as easy as adding changing the flag \-method
+mapreduce and adding the number of groups parameter e.g. \-g 20 for 20
+groups. First, let's run the above sample test in map-reduce mode:
+
+    bin/mahout fpg \
+         -i core/src/test/resources/retail.dat \
+         -o patterns \
+         -k 50 \
+         -method mapreduce \
+         -regex '[\ ]
+' \
+         -s 2
+
+The above test took 102 seconds on dual-core laptop, v.s. 609 seconds in
+the sequential mode, (with 5 gigs of ram allocated). In a separate test,
+the first 1000 lines of retail.dat took 20 seconds in map/reduce v.s. 30
+seconds in sequential mode.
+
+Here is another dataset which, while several times larger, requires much
+less time to find frequent patterns, as there are very few. Get
+accidents.dat.gz from 
[http://fimi.cs.helsinki.fi/data/](http://fimi.cs.helsinki.fi/data/)
+ and place it on your hdfs in a folder named accidents. Then, run the
+hadoop version of the FPGrowth job:
+
+    bin/mahout fpg \
+         -i accidents \
+         -o patterns \
+         -k 50 \
+         -method mapreduce \
+         -regex '[\ ]
+' \
+         -s 2
+
+
+OR to run a dataset of this size in sequential mode on a single machine
+let's give Mahout a lot more memory and only keep features with more than
+300 members:
+
+    export MAHOUT_HEAPSIZE=-Xmx5000m
+    bin/mahout fpg \
+         -i accidents \
+         -o patterns \
+         -k 50 \
+         -method sequential \
+         -regex '[\ ]
+' \
+         -s 2
+
+
+
+The numGroups parameter \-g in FPGrowthJob specifies the number of groups
+into which transactions have to be decomposed. The default of 1000 works
+very well on a single-machine cluster; this may be very different on large
+clusters.
+
+Note that accidents.dat has 340 unique features. So we chose \-g 10 to
+split the transactions across 10 shards where 34 patterns are mined from
+each shard. (Note: g doesnt need to be exactly divisible.) The Algorithm
+takes care of calculating the split. For better performance in large
+datasets and clusters, try not to mine for more than 20-25 features per
+shard. Stick to the defaults on a small machine.
+
+The numTreeCacheEntries parameter \-tc specifies the number of generated
+conditional FP-Trees to be kept in memory so that subsequent operations do
+not to regenerate them. Increasing this number increases the memory
+consumption but might improve speed until a certain point. This depends
+entirely on the dataset in question. A value of 5-10 is recommended for
+mining up to top 100 patterns for each feature.
+
+<a name="ParallelFrequentPatternMining-Viewingtheresults"></a>
+## Viewing the results
+The output will be dumped to a SequenceFile in the frequentpatterns
+directory in Text=>TopKStringPatterns format. Run this command to see a few
+of the Frequent Patterns:
+
+    bin/mahout seqdumper \
+         -i patterns/frequentpatterns/part-?-00000 \
+         -n 4
+
+or replace -n 4 with -c for the count of patterns.
+ 
+Open questions: how does one experiment and monitor with these various
+parameters?

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/misc/perceptron-and-winnow.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/tutorials/map-reduce/misc/perceptron-and-winnow.md 
b/website-old/docs/tutorials/map-reduce/misc/perceptron-and-winnow.md
new file mode 100644
index 0000000..a3c7063
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/misc/perceptron-and-winnow.md
@@ -0,0 +1,41 @@
+---
+layout: default
+title: (Deprecated)  Perceptron and Winnow
+theme:
+    name: retro-mahout
+---
+<a name="PerceptronandWinnow-ClassificationwithPerceptronorWinnow"></a>
+# Classification with Perceptron or Winnow
+
+Both algorithms are comparably simple linear classifiers. Given training
+data in some n-dimensional vector space that is annotated with binary
+labels the algorithms are guaranteed to find a linear separating hyperplane
+if one exists. In contrast to the Perceptron, Winnow works only for binary
+feature vectors.
+
+For more information on the Perceptron see for instance:
+http://en.wikipedia.org/wiki/Perceptron
+
+Concise course notes on both algorithms:
+http://pages.cs.wisc.edu/~shuchi/courses/787-F07/scribe-notes/lecture24.pdf
+
+Although the algorithms are comparably simple they still work pretty well
+for text classification and are fast to train even for huge example sets.
+In contrast to Naive Bayes they are not based on the assumption that all
+features (in the domain of text classification: all terms in a document)
+are independent.
+
+<a name="PerceptronandWinnow-Strategyforparallelisation"></a>
+## Strategy for parallelisation
+
+Currently the strategy for parallelisation is simple: Given there is enough
+training data, split the training data. Train the classifier on each split.
+The resulting hyperplanes are then averaged.
+
+<a name="PerceptronandWinnow-Roadmap"></a>
+## Roadmap
+
+Currently the patch only contains the code for the classifier itself. It is
+planned to provide unit tests and at least one example based on the WebKB
+dataset by the end of November for the serial version. After that the
+parallelisation will be added.

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/misc/testing.md
----------------------------------------------------------------------
diff --git a/website-old/docs/tutorials/map-reduce/misc/testing.md 
b/website-old/docs/tutorials/map-reduce/misc/testing.md
new file mode 100644
index 0000000..826fff8
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/misc/testing.md
@@ -0,0 +1,46 @@
+---
+layout: default
+title: (Deprecated)  Testing
+theme:
+    name: retro-mahout
+---
+<a name="Testing-Intro"></a>
+# Intro
+
+As Mahout matures, solid testing procedures are needed.  This page and its
+children capture test plans along with ideas for improving our testing.
+
+<a name="Testing-TestPlans"></a>
+# Test Plans
+
+* [0.6](0.6.html)
+ - Test Plans for the 0.6 release
+There are no special plans except for unit tests, and user testing of the
+Hadoop jobs.
+
+<a name="Testing-TestIdeas"></a>
+# Test Ideas
+
+<a name="Testing-Regressions/Benchmarks/Integrations"></a>
+## Regressions/Benchmarks/Integrations
+* Algorithmic quality and speed are not tested, except in a few instances.
+Such tests often require much longer run times (minutes to hours), a
+running Hadoop cluster, and downloads of large datasets (in the megabytes). 
+* Standardized speed tests are difficult on different hardware. 
+* Unit tests of external integrations require access to externals: HDFS,
+S3, JDBC, Cassandra, etc. 
+
+Apache Jenkins is not able to support these environments. Commercial
+donations would help. 
+
+<a name="Testing-UnitTests"></a>
+## Unit Tests
+Mahout's current tests are almost entirely unit tests. Algorithm tests
+generally supply a few numbers to code paths and verify that expected
+numbers come out. 'mvn test' runs these tests. There is "positive" coverage
+of a great many utilities and algorithms. A much smaller percent include
+"negative" coverage (bogus setups, inputs, combinations).
+
+<a name="Testing-Other"></a>
+## Other
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/misc/using-mahout-with-python-via-jpype.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/tutorials/map-reduce/misc/using-mahout-with-python-via-jpype.md
 
b/website-old/docs/tutorials/map-reduce/misc/using-mahout-with-python-via-jpype.md
new file mode 100644
index 0000000..4b6af12
--- /dev/null
+++ 
b/website-old/docs/tutorials/map-reduce/misc/using-mahout-with-python-via-jpype.md
@@ -0,0 +1,222 @@
+---
+layout: default
+title: (Deprecated)  Using Mahout with Python via JPype
+theme:
+    name: retro-mahout
+---
+
+<a name="UsingMahoutwithPythonviaJPype-overview"></a>
+# Mahout over Jython - some examples
+This tutorial provides some sample code illustrating how we can read and
+write sequence files containing Mahout vectors from Python using JPype.
+This tutorial is intended for people who want to use Python for analyzing
+and plotting Mahout data. Using Mahout from Python turns out to be quite
+easy.
+
+This tutorial concerns the use of cPython (cython) as opposed to Jython.
+JPython wasn't an option for me, because  (to the best of my knowledge)
+JPython doesn't work with Python extensions numpy, matplotlib, or h5py
+which I rely on heavily.
+
+The instructions below explain how to setup a python script to read and
+write the output of Mahout clustering.
+
+You will first need to download and install the JPype package for python.
+
+The first step to setting up JPype is determining the path to the dynamic
+library for the jvm ; on linux this will be a .so file on and on windows it
+will be a .dll.
+
+In your python script, create a global variable with the path to this dll
+
+
+
+Next we need to figure out how we need to set the classpath for mahout. The
+easiest way to do this is to edit the script in "bin/mahout" to print out
+the classpath. Add the line "echo $CLASSPATH" to the script somewhere after
+the comment "run it" (this is line 195 or so). Execute the script to print
+out the classpath.  Copy this output and paste it into a variable in your
+python script. The result for me looks like the following
+
+
+
+
+Now we can create a function to start the jvm in python using jype
+
+    jvm=None
+    def start_jpype():
+    global jvm
+    if (jvm is None):
+    cpopt="-Djava.class.path={cp}".format(cp=classpath)
+    startJVM(jvmlib,"-ea",cpopt)
+    jvm="started"
+
+
+
+<a 
name="UsingMahoutwithPythonviaJPype-WritingNamedVectorstoSequenceFilesfromPython"></a>
+# Writing Named Vectors to Sequence Files from Python
+We can now use JPype to create sequence files which will contain vectors to
+be used by Mahout for kmeans. The example below is a function which creates
+vectors from two Gaussian distributions with unit variance.
+
+
+    def create_inputs(ifile,*args,**param):
+     """Create a sequence file containing some normally distributed
+       ifile - path to the sequence file to create
+     """
+     
+     #matrix of the cluster means
+     cmeans=np.array([[1,1] ,[-1,-1]],np.int)
+     
+     nperc=30  #number of points per cluster
+     
+     vecs=[]
+     
+     vnames=[]
+     for cind in range(cmeans.shape[0]):
+      pts=np.random.randn(nperc,2)
+      pts=pts+cmeans[cind,:].reshape([1,cmeans.shape[1]])
+      vecs.append(pts)
+     
+      #names for the vectors
+      #names are just the points with an index
+      #we do this so we can validate by cross-refencing the name with thevector
+      vn=np.empty(nperc,dtype=(np.str,30))
+      for row in range(nperc):
+       
vn[row]="c"+str(cind)+"_"+pts[row,0].astype((np.str,4))+"_"+pts[row,1].astype((np.str,4))
+      vnames.append(vn)
+      
+     vecs=np.vstack(vecs)
+     vnames=np.hstack(vnames)
+     
+    
+     #start the jvm
+     start_jpype()
+     
+     #create the sequence file that we will write to
+     io=JPackage("org").apache.hadoop.io 
+     FileSystemCls=JPackage("org").apache.hadoop.fs.FileSystem
+     
+     PathCls=JPackage("org").apache.hadoop.fs.Path
+     path=PathCls(ifile)
+    
+     ConfCls=JPackage("org").apache.hadoop.conf.Configuration 
+     conf=ConfCls()
+     
+     fs=FileSystemCls.get(conf)
+     
+     #vector classes
+     VectorWritableCls=JPackage("org").apache.mahout.math.VectorWritable
+     DenseVectorCls=JPackage("org").apache.mahout.math.DenseVector
+     NamedVectorCls=JPackage("org").apache.mahout.math.NamedVector
+     writer=io.SequenceFile.createWriter(fs, conf, 
path,io.Text,VectorWritableCls)
+     
+     
+     vecwritable=VectorWritableCls()
+     for row in range(vecs.shape[0]):
+      
nvector=NamedVectorCls(DenseVectorCls(JArray(JDouble,1)(vecs[row,:])),vnames[row])
+      #need to wrap key and value because of overloading
+      wrapkey=JObject(io.Text("key "+str(row)),io.Writable)
+      wrapval=JObject(vecwritable,io.Writable)
+      
+      vecwritable.set(nvector)
+      writer.append(wrapkey,wrapval)
+      
+     writer.close()
+
+
+<a 
name="UsingMahoutwithPythonviaJPype-ReadingtheKMeansClusteredPointsfromPython"></a>
+# Reading the KMeans Clustered Points from Python
+Similarly we can use JPype to easily read the clustered points outputted by
+mahout.
+
+    def read_clustered_pts(ifile,*args,**param):
+     """Read the clustered points
+     ifile - path to the sequence file containing the clustered points
+     """ 
+    
+     #start the jvm
+     start_jpype()
+     
+     #create the sequence file that we will write to
+     io=JPackage("org").apache.hadoop.io 
+     FileSystemCls=JPackage("org").apache.hadoop.fs.FileSystem
+     
+     PathCls=JPackage("org").apache.hadoop.fs.Path
+     path=PathCls(ifile)
+    
+     ConfCls=JPackage("org").apache.hadoop.conf.Configuration 
+     conf=ConfCls()
+     
+     fs=FileSystemCls.get(conf)
+     
+     #vector classes
+     VectorWritableCls=JPackage("org").apache.mahout.math.VectorWritable
+     NamedVectorCls=JPackage("org").apache.mahout.math.NamedVector
+     
+     
+     ReaderCls=io.__getattribute__("SequenceFile$Reader") 
+     reader=ReaderCls(fs, path,conf)
+     
+    
+     key=reader.getKeyClass()()
+     
+    
+     valcls=reader.getValueClass()
+     vecwritable=valcls()
+     while (reader.next(key,vecwritable)):     
+      weight=vecwritable.getWeight()
+      nvec=vecwritable.getVector()
+      
+      cname=nvec.__class__.__name__
+      if (cname.rsplit('.',1)[1]=="NamedVector"):  
+       print "cluster={key} Name={name} 
x={x}y={y}".format(key=key.toString(),name=nvec.getName(),x=nvec.get(0),y=nvec.get(1))
+      else:
+       raise NotImplementedError("Vector isn't a NamedVector. Need 
tomodify/test the code to handle this case.")
+
+
+<a name="UsingMahoutwithPythonviaJPype-ReadingtheKMeansCentroids"></a>
+# Reading the KMeans Centroids
+Finally we can create a function to print out the actual cluster centers
+found by mahout,
+
+    def getClusters(ifile,*args,**param):
+     """Read the centroids from the clusters outputted by kmenas
+          ifile - Path to the sequence file containing the centroids
+     """ 
+    
+     #start the jvm
+     start_jpype()
+     
+     #create the sequence file that we will write to
+     io=JPackage("org").apache.hadoop.io 
+     FileSystemCls=JPackage("org").apache.hadoop.fs.FileSystem
+     
+     PathCls=JPackage("org").apache.hadoop.fs.Path
+     path=PathCls(ifile)
+    
+     ConfCls=JPackage("org").apache.hadoop.conf.Configuration 
+     conf=ConfCls()
+     
+     fs=FileSystemCls.get(conf)
+     
+     #vector classes
+     VectorWritableCls=JPackage("org").apache.mahout.math.VectorWritable
+     NamedVectorCls=JPackage("org").apache.mahout.math.NamedVector
+     ReaderCls=io.__getattribute__("SequenceFile$Reader")
+     reader=ReaderCls(fs, path,conf)
+     
+    
+     key=io.Text()
+     
+    
+     valcls=reader.getValueClass()
+    
+     vecwritable=valcls()
+     
+     while (reader.next(key,vecwritable)):     
+      center=vecwritable.getCenter()
+      
+      print 
"id={cid}center={center}".format(cid=vecwritable.getId(),center=center.values)
+      pass
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/ec5eb314/website-old/docs/tutorials/map-reduce/recommender/intro-als-hadoop.md
----------------------------------------------------------------------
diff --git 
a/website-old/docs/tutorials/map-reduce/recommender/intro-als-hadoop.md 
b/website-old/docs/tutorials/map-reduce/recommender/intro-als-hadoop.md
new file mode 100644
index 0000000..ad540c7
--- /dev/null
+++ b/website-old/docs/tutorials/map-reduce/recommender/intro-als-hadoop.md
@@ -0,0 +1,98 @@
+---
+layout: default
+title: (Deprecated)  Perceptron and Winnow
+theme:
+    name: retro-mahout
+---
+
+# Introduction to ALS Recommendations with Hadoop
+
+##Overview
+
+Mahoutâs ALS recommender is a matrix factorization algorithm that uses 
Alternating Least Squares with Weighted-Lamda-Regularization (ALS-WR). It 
factors the user to item matrix *A* into the user-to-feature matrix *U* and the 
item-to-feature matrix *M*: It runs the ALS algorithm in a parallel fashion. 
The algorithm details can be referred to in the following papers: 
+
+* [Large-scale Parallel Collaborative Filtering for
+the Netflix 
Prize](http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08%28submitted%29.pdf)
+* [Collaborative Filtering for Implicit Feedback 
Datasets](http://research.yahoo.com/pub/2433) 
+
+This recommendation algorithm can be used in eCommerce platform to recommend 
products to customers. Unlike the user or item based recommenders that computes 
the similarity of users or items to make recommendations, the ALS algorithm 
uncovers the latent factors that explain the observed user to item ratings and 
tries to find optimal factor weights to minimize the least squares between 
predicted and actual ratings.
+
+Mahout's ALS recommendation algorithm takes as input user preferences by item 
and generates an output of recommending items for a user. The input customer 
preference could either be explicit user ratings or implicit feedback such as 
user's click on a web page.
+
+One of the strengths of the ALS based recommender, compared to the user or 
item based recommender, is its ability to handle large sparse data sets and its 
better prediction performance. It could also gives an intuitive rationale of 
the factors that influence recommendations.
+
+##Implementation
+At present Mahout has a map-reduce implementation of ALS, which is composed of 
2 jobs: a parallel matrix factorization job and a recommendation job.
+The matrix factorization job computes the user-to-feature matrix and 
item-to-feature matrix given the user to item ratings. Its input includes: 
+<pre>
+    --input: directory containing files of explicit user to item rating or 
implicit feedback;
+    --output: output path of the user-feature matrix and feature-item matrix;
+    --lambda: regularization parameter to avoid overfitting;
+    --alpha: confidence parameter only used on implicit feedback
+    --implicitFeedback: boolean flag to indicate whether the input dataset 
contains implicit feedback;
+    --numFeatures: dimensions of feature space;
+    --numThreadsPerSolver: number of threads per solver mapper for concurrent 
execution;
+    --numIterations: number of iterations
+    --usesLongIDs: boolean flag to indicate whether the input contains long 
IDs that need to be translated
+</pre>
+and it outputs the matrices in sequence file format. 
+
+The recommendation job uses the user feature matrix and item feature matrix 
calculated from the factorization job to compute the top-N recommendations per 
user. Its input includes:
+<pre>
+    --input: directory containing files of user ids;
+    --output: output path of the recommended items for each input user id;
+    --userFeatures: path to the user feature matrix;
+    --itemFeatures: path to the item feature matrix;
+    --numRecommendations: maximum number of recommendations per user, default 
is 10;
+    --maxRating: maximum rating available;
+    --numThreads: number of threads per mapper;
+    --usesLongIDs: boolean flag to indicate whether the input contains long 
IDs that need to be translated;
+    --userIDIndex: index for user long IDs (necessary if usesLongIDs is true);
+    --itemIDIndex: index for item long IDs (necessary if usesLongIDs is true) 
+</pre>
+and it outputs a list of recommended item ids for each user. The predicted 
rating between user and item is a dot product of the user's feature vector and 
the item's feature vector.  
+
+##Example
+
+Letâs look at a simple example of how we could use Mahoutâs ALS 
recommender to recommend items for users. First, youâll need to get Mahout up 
and running, the instructions for which can be found 
[here](https://mahout.apache.org/users/basics/quickstart.html). After you've 
ensured Mahout is properly installed, weâre ready to run the example.
+
+**Step 1: Prepare test data**
+
+Similar to Mahout's item based recommender, the ALS recommender relies on the 
user to item preference data: *userID*, *itemID* and *preference*. The 
preference could be explicit numeric rating or counts of actions such as a 
click (implicit feedback). The test data file is organized as each line is a 
tab-delimited string, the 1st field is user id, which must be numeric, the 2nd 
field is item id, which must be numeric and the 3rd field is preference, which 
should also be a number.
+
+**Note:** You must create IDs that are ordinal positive integers for all user 
and item IDs. Often this will require you to keep a dictionary
+to map into and out of Mahout IDs. For instance if the first user has ID "xyz" 
in your application, this would get an Mahout ID of the integer 1 and so on. 
The same
+for item IDs. Then after recommendations are calculated you will have to 
translate the Mahout user and item IDs back into your application IDs.
+
+To quickly start, you could specify a text file like following as the input:
+<pre>
+1      100     1
+1      200     5
+1      400     1
+2      200     2
+2      300     1
+</pre>
+
+**Step 2: Determine parameters**
+
+In addition, users need to determine dimension of feature space, the number of 
iterations to run the alternating least square algorithm, Using 10 features and 
15 iterations is a reasonable default to try first. Optionally a confidence 
parameter can be set if the input preference is implicit user feedback.  
+
+**Step 3: Run ALS**
+
+Assuming your *JAVA_HOME* is appropriately set and Mahout was installed 
properly weâre ready to configure our syntax. Enter the following command:
+
+    $ mahout parallelALS --input $als_input --output $als_output --lambda 0.1 
--implicitFeedback true --alpha 0.8 --numFeatures 2 --numIterations 5  
--numThreadsPerSolver 1 --tempDir tmp 
+
+Running the command will execute a series of jobs the final product of which 
will be an output file deposited to the output directory specified in the 
command syntax. The output directory contains 3 sub-directories: *M* stores the 
item to feature matrix, *U* stores the user to feature matrix and userRatings 
stores the user's ratings on the items. The *tempDir* parameter specifies the 
directory to store the intermediate output of the job, such as the matrix 
output in each iteration and each item's average rating. Using the *tempDir* 
will help on debugging.
+
+**Step 4: Make Recommendations**
+
+Based on the output feature matrices from step 3, we could make 
recommendations for users. Enter the following command:
+
+     $ mahout recommendfactorized --input $als_recommender_input 
--userFeatures $als_output/U/ --itemFeatures $als_output/M/ 
--numRecommendations 1 --output recommendations --maxRating 1
+
+The input user file is a sequence file, the sequence record key is user id and 
value is the user's rated item ids which will be removed from recommendation. 
The output file generated in our simple example will be a text file giving the 
recommended item ids for each user. 
+Remember to translate the Mahout ids back into your application specific ids. 
+
+There exist a variety of parameters for Mahoutâs ALS recommender to 
accommodate custom business requirements; exploring and testing various 
configurations to suit your needs will doubtless lead to additional questions. 
Feel free to ask such questions on the [mailing 
list](https://mahout.apache.org/general/mailing-lists,-irc-and-archives.html).
+

[20/51] [partial] mahout git commit: New Website courtesy of startbootstrap.com

Reply via email to