http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/mlp.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/classification/mlp.md b/website/old_site_migration/needs_work_convenience/map-reduce/classification/mlp.md new file mode 100644 index 0000000..a98f033 --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/classification/mlp.md @@ -0,0 +1,172 @@ +--- +layout: default +title: Multilayer Perceptron +theme: + name: retro-mahout +--- + +Multilayer Perceptron +===================== + +A multilayer perceptron is a biologically inspired feed-forward network that can +be trained to represent a nonlinear mapping between input and output data. It +consists of multiple layers, each containing multiple artificial neuron units and +can be used for classification and regression tasks in a supervised learning approach. + +Command line usage +------------------ + +The MLP implementation is currently located in the MapReduce-Legacy package. It +can be used with the following commands: + + +# model training + $ bin/mahout org.apache.mahout.classifier.mlp.TrainMultilayerPerceptron +# model usage + $ bin/mahout org.apache.mahout.classifier.mlp.RunMultilayerPerceptron + + +To train and use the model, a number of parameters can be specified. Parameters without default values have to be specified by the user. Consider that not all parameters can be used both for training and running the model. We give an example of the usage below. + +### Parameters + +| Command | Default | Description | Type | +|:---------|---------:|:-------------|:---------| +| --input -i | | Path to the input data (currently, only .csv-files are allowed) | | +| --skipHeader -sh | false | Skip first row of the input file (corresponds to the csv headers)| | +|--update -u | false | Whether the model should be updated incrementally with every new training instance. If this parameter is not given, the model is trained from scratch. | training | +| --labels -labels | | Instance labels separated by whitespaces. | training | +| --model -mo | | Location where the model will be stored / is stored (if the specified location has an existing model, it will update the model through incremental learning). | | +| --layerSize -ls | | Number of units per layer, including input, hidden and ouput layers. This parameter specifies the topology of the network (see [this image][mlp] for an example specified by `-ls 4 8 3`). | training | +| --squashingFunction -sf| Sigmoid | The squashing function to use for the units. Currently only the sigmoid fucntion is available. | training | +| --learningRate -l | 0.5 | The learning rate that is used for weight updates. | training | +| --momemtumWeight -m | 0.1 | The momentum weight that is used for gradient descent. Must be in the range between 0 ... 1.0 | training | +| --regularizationWeight -r | 0 | Regularization value for the weight vector. Must be in the range between 0 ... 0.1 | training | +| --format -f | csv | Input file format. Currently only csv is supported. | | +|--columnRange -cr | | Range of the columns to use from the input file, starting with 0 (i.e. `-cr 0 5` for including the first six columns only) | testing | +| --output -o | | Path to store the labeled results from running the model. | testing | + +Example usage +------------- + +In this example, we will train a multilayer perceptron for classification on the iris data set. The iris flower data set contains data of three flower species where each datapoint consists of four features. +The dimensions of the data set are given through some flower parameters (sepal length, sepal width, ...). All samples contain a label that indicates the flower species they belong to. + +### Training + +To train our multilayer perceptron model from the command line, we call the following command + + + $ bin/mahout org.apache.mahout.classifier.mlp.TrainMultilayerPerceptron \ + -i ./mrlegacy/src/test/resources/iris.csv -sh \ + -labels setosa versicolor virginica \ + -mo /tmp/model.model -ls 4 8 3 -l 0.2 -m 0.35 -r 0.0001 + + +The individual parameters are explained in the following. + +- `-i ./mrlegacy/src/test/resources/iris.csv` use the iris data set as input data +- `-sh` since the file `iris.csv` contains a header row, this row needs to be skipped +- `-labels setosa versicolor virginica` we specify, which class labels should be learnt (which are the flower species in this case) +- `-mo /tmp/model.model` specify where to store the model file +- `-ls 4 8 3` we specify the structure and depth of our layers. The actual network structure can be seen in the figure below. +- `-l 0.2` we set the learning rate to `0.2` +- `-m 0.35` momemtum weight is set to `0.35` +- `-r 0.0001` regularization weight is set to `0.0001` + +| | | +|---|---| +| The picture shows the architecture defined by the above command. The topolgy of the network is completely defined through the number of layers and units because in this implementation of the MLP every unit is fully connected to the units of the next and previous layer. Bias units are added automatically. | ![Multilayer perceptron network][mlp] | + +[mlp]: mlperceptron_structure.png "Architecture of a three-layer MLP" +### Testing + +To test / run the multilayer perceptron classification on the trained model, we can use the following command + + + $ bin/mahout org.apache.mahout.classifier.mlp.RunMultilayerPerceptron \ + -i ./mrlegacy/src/test/resources/iris.csv -sh -cr 0 3 \ + -mo /tmp/model.model -o /tmp/labelResult.txt + + +The individual parameters are explained in the following. + +- `-i ./mrlegacy/src/test/resources/iris.csv` use the iris data set as input data +- `-sh` since the file `iris.csv` contains a header row, this row needs to be skipped +- `-cr 0 3` we specify the column range of the input file +- `-mo /tmp/model.model` specify where the model file is stored +- `-o /tmp/labelResult.txt` specify where the labeled output file will be stored + +Implementation +-------------- + +The Multilayer Perceptron implementation is based on a more general Neural Network class. Command line support was added later on and provides a simple usage of the MLP as shown in the example. It is implemented to run on a single machine using stochastic gradient descent where the weights are updated using one datapoint at a time, resulting in a weight update of the form: +$$ \vec{w}^{(t + 1)} = \vec{w}^{(t)} - n \Delta E_n(\vec{w}^{(t)}) $$ + +where *a* is the activation of the unit. It is not yet possible to change the learning to more advanced methods using adaptive learning rates yet. + +The number of layers and units per layer can be specified manually and determines the whole topology with each unit being fully connected to the previous layer. A bias unit is automatically added to the input of every layer. +Currently, the logistic sigmoid is used as a squashing function in every hidden and output layer. It is of the form: + +$$ \frac{1}{1 + exp(-a)} $$ + +The command line version **does not perform iterations** which leads to bad results on small datasets. Another restriction is, that the CLI version of the MLP only supports classification, since the labels have to be given explicitly when executing on the command line. + +A learned model can be stored and updated with new training instanced using the `--update` flag. Output of classification reults is saved as a .txt-file and only consists of the assigned labels. Apart from the command-line interface, it is possible to construct and compile more specialized neural networks using the API and interfaces in the mrlegacy package. + + +Theoretical Background +------------------------- + +The *multilayer perceptron* was inspired by the biological structure of the brain where multiple neurons are connected and form columns and layers. Perceptual input enters this network through our sensory organs and is then further processed into higher levels. +The term multilayer perceptron is a little misleading since the *perceptron* is a special case of a single *artificial neuron* that can be used for simple computations [\[1\]][1]. The difference is that the perceptron uses a discontinous nonlinearity while for the MLP neurons that are implemented in mahout it is important to use continous nonlinearities. This is necessary for the implemented learning algorithm, where the error is propagated back from the output layer to the input layer and the weights of the connections are changed according to their contribution to the overall error. This algorithm is called backpropagation and uses gradient descent to update the weights. To compute the gradients we need continous nonlinearities. But let's start from the beginning! + +The first layer of the MLP represents the input and has no other purpose than routing the input to every connected unit in a feed-forward fashion. Following layers are called hidden layers and the last layer serves the special purpose to determine the output. The activation of a unit *u* in a hidden layer is computed through a weighted sum of all inputs, resulting in +$$ a_j = \sum_{i=1}^{D} w_{ji}^{(l)} x_i + w_{j0}^{(l)} $$ +This computes the activation *a* for neuron *j* where *w* is the weight from neuron *i* to neuron *j* in layer *l*. The last part, where *i = 0* is called the bias and can be used as an offset, independent from the input. + +The activation is then transformed by the aforementioned differentiable, nonlinear *activation function* and serves as the input to the next layer. The activation function is usually chosen from the family of sigmoidal functions such as *tanh* or *logistic sigmoidal* [\[2\]][2]. Often sigmoidal and logistic sigmoidal are used synonymous. Another word for the activation function is *squashing function* since the s-shape of this function class *squashes* the input. + +For different units or layers, different activation functions can be used to obtain different behaviors. Especially in the output layer, the activation function can be chosen to obtain the output value *y*, depending on the learning problem: +$$ y_k = \sigma (a_k) $$ + +If the learning problem is a linear regression task, sigma can be chosen to be the identity function. In case of classification problems, the choice of the squashing functions depends on the exact task at hand and often softmax activation functions are used. + +The equation for a MLP with three layers (one input, one hidden and one output) is then given by + +$$ y_k(\vec{x}, \vec{w}) = h \left( \sum_{j=1}^{M} w_{kj}^{(2)} h \left( \sum_{i=1}^{D} w_{ji}^{(1)} x_i + w_{j0}^{(1)} \right) + w_{k0}^{(2)} \right) $$ + +where *h* indicates the respective squashing function that is used in the units of a layer. *M* and *D* specify the number of incoming connections to a unit and we can see that the input to the first layer (hidden layer) is just the original input *x* whereas the input into the second layer (output layer) is the transformed output of layer one. The output *y* of unit *k* is therefore given by the above equation and depends on the input *x* and the weight vector *w*. This shows us, that the parameter that we can optimize during learning is *w* since we can not do anything about the input *x*. To facilitate the following steps, we can include the bias-terms into the weight vector and correct for the indices by adding another dimension with the value 1 to the input vector. The bias is a constant factor that is added to the weighted sum and that serves as a scaling factor of the nonlinear transformation. Including it into the weight vector leads to: + +$$ y_k(\vec{x}, \vec{w}) = h \left( \sum_{j=0}^{M} w_{kj}^{(2)} h \left( \sum_{i=0}^{D} w_{ji}^{(1)} x_i \right) \right) $$ + +The previous paragraphs described how the MLP transforms a given input into some output using a combination of different nonlinear functions. Of course what we really want is to learn the structure of our data so that we can feed data with unknown labels into the network and get the estimated target labels *t*. To achieve this, we have to train our network. In this context, training means optimizing some function such that the error between the real labels *y* and the network-output *t* becomes smallest. We have seen in the previous pragraph, that our only knob to change is the weight vector *w*, making the function to be optimized a function of *w*. For simplicitly and because it is widely used, we choose the so called *sum-of-squares* error function as an example that is given by + +$$ E(\vec{w}) = \frac{1}{2} \sum_{n=1}^N \left( y(\vec{x}_n, \vec{w}) - t_n \right)^2 $$ + +The goal is to minimize this function and thereby increase the performance of our model. A common method to achieve this is to use gradient descent and the so called technique of *backpropagation* where the goal is to compute the contribution of every unit to the overall error and changing the weight according to this contribution and into the direction of the gradient of the error function at this particular unit. In the following we try to give a short overview of the model training with gradient descent and backpropagation. A more detailed example can be found in [\[3\]][3] where much of this information is taken from. + +The problem with minimizing the error function is that the error can only be computed at the output layers where we get *t*, but we want to update all the weights of all the units. Therefore we use the technique of backpropagation to propagate the error, that we first compute at the output layer, back to the units of the previous layers. For this approach we also need to compute the gradients of the activation function. + +Weights are then updated with a small step in the direction of the negative gradient, regulated by the learning rate *n* such that we arrive at the formula for weight update: + +$$ \vec{w}^{(t + 1)} = \vec{w}^{(t)} - n \Delta E(\vec{w}^{(t)}) $$ + +A momentum weight can be set as a parameter of the gradient descent method to increase the probability of finding better local or global optima of the error function. + + + + + +[1]: http://en.wikipedia.org/wiki/Perceptron "The perceptron in wikipedia" +[2]: http://en.wikipedia.org/wiki/Sigmoid_function "Sigmoid function on wikipedia" +[3]: http://research.microsoft.com/en-us/um/people/cmbishop/prml/ "Christopher M. Bishop: Pattern Recognition and Machine Learning, Springer 2009" + +References + +\[1\] http://en.wikipedia.org/wiki/Perceptron + +\[2\] http://en.wikipedia.org/wiki/Sigmoid_function + +\[3\] [Christopher M. Bishop: Pattern Recognition and Machine Learning, Springer 2009](http://research.microsoft.com/en-us/um/people/cmbishop/prml/) +
http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/naivebayes.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/classification/naivebayes.md b/website/old_site_migration/needs_work_convenience/map-reduce/classification/naivebayes.md new file mode 100644 index 0000000..a697653 --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/classification/naivebayes.md @@ -0,0 +1,45 @@ +--- +layout: default +title: NaiveBayes +theme: + name: retro-mahout +--- + +<a name="NaiveBayes-NaiveBayes"></a> +# Naive Bayes + +Naive Bayes is an algorithm that can be used to classify objects into +usually binary categories. It is one of the most common learning algorithms +in spam filters. Despite its simplicity and rather naive assumptions it has +proven to work surprisingly well in practice. + +Before applying the algorithm, the objects to be classified need to be +represented by numerical features. In the case of e-mail spam each feature +might indicate whether some specific word is present or absent in the mail +to classify. The algorithm comes in two phases: Learning and application. +During learning, a set of feature vectors is given to the algorithm, each +vector labeled with the class the object it represents, belongs to. From +that it is deduced which combination of features appears with high +probability in spam messages. Given this information, during application +one can easily compute the probability of a new message being either spam +or not. + +The algorithm does make several assumptions, that are not true for most +datasets, but make computations easier. The worst probably being, that all +features of an objects are considered independent. In practice, that means, +given the phrase "Statue of Liberty" was already found in a text, does not +influence the probability of seeing the phrase "New York" as well. + +<a name="NaiveBayes-StrategyforaparallelNaiveBayes"></a> +## Strategy for a parallel Naive Bayes + +See [https://issues.apache.org/jira/browse/MAHOUT-9](https://issues.apache.org/jira/browse/MAHOUT-9) +. + + +<a name="NaiveBayes-Examples"></a> +## Examples + +[20Newsgroups](20newsgroups.html) + - Example code showing how to train and use the Naive Bayes classifier +using the 20 Newsgroups data available at [http://people.csail.mit.edu/jrennie/20Newsgroups/] http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/neural-network.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/classification/neural-network.md b/website/old_site_migration/needs_work_convenience/map-reduce/classification/neural-network.md new file mode 100644 index 0000000..7180656 --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/classification/neural-network.md @@ -0,0 +1,22 @@ +--- +layout: default +title: Neural Network +theme: + name: retro-mahout +--- + +<a name="NeuralNetwork-NeuralNetworks"></a> +# Neural Networks + +Neural Networks are a means for classifying multi dimensional objects. We +concentrate on implementing back propagation networks with one hidden layer +as these networks have been covered by the [2006 NIPS map reduce paper](http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf) +. Those networks are capable of learning not only linear separating hyper +planes but arbitrary decision boundaries. + +<a name="NeuralNetwork-Strategyforparallelbackpropagationnetwork"></a> +## Strategy for parallel backpropagation network + + +<a name="NeuralNetwork-Designofimplementation"></a> +## Design of implementation http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/partial-implementation.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/classification/partial-implementation.md b/website/old_site_migration/needs_work_convenience/map-reduce/classification/partial-implementation.md new file mode 100644 index 0000000..2a20ccb --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/classification/partial-implementation.md @@ -0,0 +1,146 @@ +--- +layout: default +title: Partial Implementation +theme: + name: retro-mahout +--- + + +# Classifying with random forests + +<a name="PartialImplementation-Introduction"></a> +# Introduction + +This quick start page shows how to build a decision forest using the +partial implementation. This tutorial also explains how to use the decision +forest to classify new data. +Partial Decision Forests is a mapreduce implementation where each mapper +builds a subset of the forest using only the data available in its +partition. This allows building forests using large datasets as long as +each partition can be loaded in-memory. + +<a name="PartialImplementation-Steps"></a> +# Steps +<a name="PartialImplementation-Downloadthedata"></a> +## Download the data +* The current implementation is compatible with the UCI repository file +format. In this example we'll use the NSL-KDD dataset because its large +enough to show the performances of the partial implementation. +You can download the dataset here http://nsl.cs.unb.ca/NSL-KDD/ +You can either download the full training set "KDDTrain+.ARFF", or a 20% +subset "KDDTrain+_20Percent.ARFF" (we'll use the full dataset in this +tutorial) and the test set "KDDTest+.ARFF". +* Open the train and test files and remove all the lines that begin with +'@'. All those lines are at the top of the files. Actually you can keep +those lines somewhere, because they'll help us describe the dataset to +Mahout +* Put the data in HDFS: {code} +$HADOOP_HOME/bin/hadoop fs -mkdir testdata +$HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata{code} + +<a name="PartialImplementation-BuildtheJobfiles"></a> +## Build the Job files +* In $MAHOUT_HOME/ run: {code}mvn clean install -DskipTests{code} + +<a name="PartialImplementation-Generateafiledescriptorforthedataset:"></a> +## Generate a file descriptor for the dataset: +run the following command: + + $HADOOP_HOME/bin/hadoop jar +$MAHOUT_HOME/core/target/mahout-core-<VERSION>-job.jar +org.apache.mahout.classifier.df.tools.Describe -p testdata/KDDTrain+.arff +-f testdata/KDDTrain+.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L + +The "N 3 C 2 N C 4 N C 8 N 2 C 19 N L" string describes all the attributes +of the data. In this cases, it means 1 numerical(N) attribute, followed by +3 Categorical(C) attributes, ...L indicates the label. You can also use 'I' +to ignore some attributes + +<a name="PartialImplementation-Runtheexample"></a> +## Run the example + + + $HADOOP_HOME/bin/hadoop jar +$MAHOUT_HOME/examples/target/mahout-examples-<version>-job.jar +org.apache.mahout.classifier.df.mapreduce.BuildForest +-Dmapred.max.split.size=1874231 -d testdata/KDDTrain+.arff -ds +testdata/KDDTrain+.info -sl 5 -p -t 100 -o nsl-forest + +which builds 100 trees (-t argument) using the partial implementation (-p). +Each tree is built using 5 random selected attribute per node (-sl +argument) and the example outputs the decision tree in the "nsl-forest" +directory (-o). +The number of partitions is controlled by the -Dmapred.max.split.size +argument that indicates to Hadoop the max. size of each partition, in this +case 1/10 of the size of the dataset. Thus 10 partitions will be used. +IMPORTANT: using less partitions should give better classification results, +but needs a lot of memory. So if the Jobs are failing, try increasing the +number of partitions. +* The example outputs the Build Time and the oob error estimation + + + 10/03/13 17:57:29 INFO mapreduce.BuildForest: Build Time: 0h 7m 43s 582 + 10/03/13 17:57:33 INFO mapreduce.BuildForest: oob error estimate : +0.002325895231517865 + 10/03/13 17:57:33 INFO mapreduce.BuildForest: Storing the forest in: +nsl-forest/forest.seq + + +<a name="PartialImplementation-UsingtheDecisionForesttoClassifynewdata"></a> +## Using the Decision Forest to Classify new data +run the following command: + + $HADOOP_HOME/bin/hadoop jar +$MAHOUT_HOME/examples/target/mahout-examples-<version>-job.jar +org.apache.mahout.classifier.df.mapreduce.TestForest -i +nsl-kdd/KDDTest+.arff -ds nsl-kdd/KDDTrain+.info -m nsl-forest -a -mr -o +predictions + +This will compute the predictions of "KDDTest+.arff" dataset (-i argument) +using the same data descriptor generated for the training dataset (-ds) and +the decision forest built previously (-m). Optionally (if the test dataset +contains the labels of the tuples) run the analyzer to compute the +confusion matrix (-a), and you can also store the predictions in a text +file or a directory of text files(-o). Passing the (-mr) parameter will use +Hadoop to distribute the classification. + +* The example should output the classification time and the confusion +matrix + + + 10/03/13 18:08:56 INFO mapreduce.TestForest: Classification Time: 0h 0m 6s +355 + 10/03/13 18:08:56 INFO mapreduce.TestForest: +======================================================= + Summary + ------------------------------------------------------- + Correctly Classified Instances : 17657 78.3224% + Incorrectly Classified Instances : 4887 21.6776% + Total Classified Instances : 22544 + + ======================================================= + Confusion Matrix + ------------------------------------------------------- + a b <--Classified as + 9459 252 | 9711 a = normal + 4635 8198 | 12833 b = anomaly + Default Category: unknown: 2 + + +If the input is a single file then the output will be a single text file, +in the above example 'predictions' would be one single file. If the input +if a directory containing for example two files 'a.data' and 'b.data', then +the output will be a directory 'predictions' containing two files +'a.data.out' and 'b.data.out' + +<a name="PartialImplementation-KnownIssuesandlimitations"></a> +## Known Issues and limitations +The "Decision Forest" code is still "a work in progress", many features are +still missing. Here is a list of some known issues: +* For now, the training does not support multiple input files. The input +dataset must be one single file (this support will be available with the upcoming release). +Classifying new data does support multiple +input files. +* The tree building is done when each mapper.close() method is called. +Because the mappers don't refresh their state, the job can fail when the +dataset is big and you try to build a large number of trees. http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/random-forests.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/classification/random-forests.md b/website/old_site_migration/needs_work_convenience/map-reduce/classification/random-forests.md new file mode 100644 index 0000000..c8b1a47 --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/classification/random-forests.md @@ -0,0 +1,234 @@ +--- +layout: default +title: Random Forests +theme: + name: retro-mahout +--- + +<a name="RandomForests-HowtogrowaDecisionTree"></a> +### How to grow a Decision Tree + +source : \[3\](3\.html) + +LearnUnprunedTree(*X*,*Y*) + +Input: *X* a matrix of *R* rows and *M* columns where *X{*}{*}{~}ij{~}* = +the value of the *j*'th attribute in the *i*'th input datapoint. Each +column consists of either all real values or all categorical values. +Input: *Y* a vector of *R* elements, where *Y{*}{*}{~}i{~}* = the output +class of the *i*'th datapoint. The *Y{*}{*}{~}i{~}* values are categorical. +Output: An Unpruned decision tree + + +If all records in *X* have identical values in all their attributes (this +includes the case where *R<2*), return a Leaf Node predicting the majority +output, breaking ties randomly. This case also includes +If all values in *Y* are the same, return a Leaf Node predicting this value +as the output +Else + select *m* variables at random out of the *M* variables + For *j* = 1 .. *m* + If *j*'th attribute is +categorical +* +IG{*}{*}{~}j{~}* = IG(*Y*\|*X{*}{*}{~}j{~}*) (see Information +Gain) + Else (*j*'th attribute is +real-valued) +* +IG{*}{*}{~}j{~}* = IG*(*Y*\|*X{*}{*}{~}j{~}*) (see Information Gain) + Let *j\** = argmax{~}j~ *IG{*}{*}{~}j{~}* (this is the +splitting attribute we'll use) + If *j\** is categorical then + For each value *v* of the *j*'th +attribute + Let +*X{*}{*}{^}v{^}* = subset of rows of *X* in which *X{*}{*}{~}ij{~}* = *v*. +Let *Y{*}{*}{^}v{^}* = corresponding subset of *Y* + Let *Child{*}{*}{^}v{^}* = +LearnUnprunedTree(*X{*}{*}{^}v{^}*,*Y{*}{*}{^}v{^}*) + Return a decision tree node, +splitting on *j*'th attribute. The number of children equals the number of +values of the *j*'th attribute, and the *v*'th child is +*Child{*}{*}{^}v{^}* + Else *j\** is real-valued and let *t* be the best split +threshold + Let *X{*}{*}{^}LO{^}* = subset +of rows of *X* in which *X{*}{*}{~}ij{~}* *<= t*. Let *Y{*}{*}{^}LO{^}* = +corresponding subset of *Y* + Let *Child{*}{*}{^}LO{^}* = +LearnUnprunedTree(*X{*}{*}{^}LO{^}*,*Y{*}{*}{^}LO{^}*) + Let *X{*}{*}{^}HI{^}* = subset of rows of *X* +in which *X{*}{*}{~}ij{~}* *> t*. Let *Y{*}{*}{^}HI{^}* = corresponding +subset of *Y* + Let *Child{*}{*}{^}HI{^}* = +LearnUnprunedTree(*X{*}{*}{^}HI{^}*,*Y{*}{*}{^}HI{^}*) + Return a decision tree node, splitting on +*j*'th attribute. It has two children corresponding to whether the *j*'th +attribute is above or below the given threshold. + +*Note*: There are alternatives to Information Gain for splitting nodes + + +<a name="RandomForests-Informationgain"></a> +### Information gain + +source : \[3\](3\.html) +1. h4. nominal attributes + +suppose X can have one of m values V{~}1~,V{~}2~,...,V{~}m~ +P(X=V{~}1~)=p{~}1~, P(X=V{~}2~)=p{~}2~,...,P(X=V{~}m~)=p{~}m~ + +H(X)= \-sum{~}j=1{~}{^}m^ p{~}j~ log{~}2~ p{~}j~ (The entropy of X) +H(Y\|X=v) = the entropy of Y among only those records in which X has value +v +H(Y\|X) = sum{~}j~ p{~}j~ H(Y\|X=v{~}j~) +IG(Y\|X) = H(Y) - H(Y\|X) +1. h4. real-valued attributes + +suppose X is real valued +define IG(Y\|X:t) as H(Y) - H(Y\|X:t) +define H(Y\|X:t) = H(Y\|X<t) P(X<t) + H(Y\|X>=t) P(X>=t) +define IG*(Y\|X) = max{~}t~ IG(Y\|X:t) + +<a name="RandomForests-HowtogrowaRandomForest"></a> +### How to grow a Random Forest + +source : \[1\](1\.html) + +Each tree is grown as follows: +1. if the number of cases in the training set is *N*, sample *N* cases at +random \-but with replacement, from the original data. This sample will be +the training set for the growing tree. +1. if there are *M* input variables, a number *m << M* is specified such +that at each node, *m* variables are selected at random out of the *M* and +the best split on these *m* is used to split the node. The value of *m* is +held constant during the forest growing. +1. each tree is grown to its large extent possible. There is no pruning. + +<a name="RandomForests-RandomForestparameters"></a> +### Random Forest parameters + +source : \[2\](2\.html) +Random Forests are easy to use, the only 2 parameters a user of the +technique has to determine are the number of trees to be used and the +number of variables (*m*) to be randomly selected from the available set of +variables. +Breinman's recommendations are to pick a large number of trees, as well as +the square root of the number of variables for *m*. + + +<a name="RandomForests-Howtopredictthelabelofacase"></a> +### How to predict the label of a case + +Classify(*node*,*V*) + Input: *node* from the decision tree, if *node.attribute += j* then the split is done on the *j*'th attribute + + Input: *V* a vector of *M* columns where +*V{*}{*}{~}j{~}* = the value of the *j*'th attribute. + Output: label of *V* + + If *node* is a Leaf then + Return the value predicted +by *node* + + Else + Let *j = +node.attribute* + If *j* is +categorical then + + +Let *v* = *V{*}{*}{~}j{~}* + + +Let *child{*}{*}{^}v{^}* = child node corresponding to the attribute's +value *v* + + Return Classify(*child{*}{*}{^}v{^}*,*V*) + + Else *j* is +real-valued + + +Let *t = node.threshold* (split threshold) + + If Vj < t then + + Let *child{*}{*}{^}LO{^}* = child +node corresponding to (*<t*) + + Return +Classify(*child{*}{*}{^}LO{^}*,*V*) + + +Else + + Let *child{*}{*}{^}HI{^}* = +child node corresponding to (*>=t*) + + Return +Classify(*child{*}{*}{^}HI{^}*,*V*) + + +<a name="RandomForests-Theoutofbag(oob)errorestimation"></a> +### The out of bag (oob) error estimation + +source : \[1\](1\.html) + +in random forests, there is no need for cross-validation or a separate test +set to get an unbiased estimate of the test set error. It is estimated +internally, during the run, as follows: +* each tree is constructed using a different bootstrap sample from the +original data. About one-third of the cases left of the bootstrap sample +and not used in the construction of the _kth_ tree. +* put each case left out in the construction of the _kth_ tree down the +_kth{_}tree to get a classification. In this way, a test set classification +is obtained for each case in about one-thrid of the trees. At the end of +the run, take *j* to be the class that got most of the the votes every time +case *n* was _oob_. The proportion of times that *j* is not equal to the +true class of *n* averaged over all cases is the _oob error estimate_. This +has proven to be unbiased in many tests. + +<a name="RandomForests-OtherRFuses"></a> +### Other RF uses + +source : \[1\](1\.html) +* variable importance +* gini importance +* proximities +* scaling +* prototypes +* missing values replacement for the training set +* missing values replacement for the test set +* detecting mislabeled cases +* detecting outliers +* detecting novelties +* unsupervised learning +* balancing prediction error +Please refer to \[1\](1\.html) + for a detailed description + +<a name="RandomForests-References"></a> +### References + +\[1\](1\.html) + Random Forests - Classification Description + [http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm](http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm) +\[2\](2\.html) + B. Larivi�re & D. Van Den Poel, 2004. "Predicting Customer Retention +and Profitability by Using Random Forests and Regression Forests +Techniques," + Working Papers of Faculty of +Economics and Business Administration, Ghent University, Belgium 04/282, +Ghent University, + Faculty of Economics and +Business Administration. + Available online : [http://ideas.repec.org/p/rug/rugwps/04-282.html](http://ideas.repec.org/p/rug/rugwps/04-282.html) +\[3\](3\.html) + Decision Trees - Andrew W. Moore\[4\] + http://www.cs.cmu.edu/~awm/tutorials\[1\](1\.html) +\[4\](4\.html) + Information Gain - Andrew W. Moore + [http://www.cs.cmu.edu/~awm/tutorials](http://www.cs.cmu.edu/~awm/tutorials) http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/restricted-boltzmann-machines.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/classification/restricted-boltzmann-machines.md b/website/old_site_migration/needs_work_convenience/map-reduce/classification/restricted-boltzmann-machines.md new file mode 100644 index 0000000..0aa8641 --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/classification/restricted-boltzmann-machines.md @@ -0,0 +1,49 @@ +--- +layout: default +title: Restricted Boltzmann Machines +theme: + name: retro-mahout +--- + +NOTE: This implementation is a Work-In-Progress, at least till September, +2010. + +The JIRA issue is [here](https://issues.apache.org/jira/browse/MAHOUT-375) +. + +<a name="RestrictedBoltzmannMachines-BoltzmannMachines"></a> +### Boltzmann Machines +Boltzmann Machines are a type of stochastic neural networks that closely +resemble physical processes. They define a network of units with an overall +energy that is evolved over a period of time, until it reaches thermal +equilibrium. + +However, the convergence speed of Boltzmann machines that have +unconstrained connectivity is low. + +<a name="RestrictedBoltzmannMachines-RestrictedBoltzmannMachines"></a> +### Restricted Boltzmann Machines +Restricted Boltzmann Machines are a variant, that are 'restricted' in the +sense that connections between hidden units of a single layer are _not_ +allowed. In addition, stacking multiple RBM's is also feasible, with the +activities of the hidden units forming the base for a higher-level RBM. The +combination of these two features renders RBM's highly usable for +parallelization. + +In the Netflix Prize, RBM's offered distinctly orthogonal predictions to +SVD and k-NN approaches, and contributed immensely to the final solution. + +<a name="RestrictedBoltzmannMachines-RBM'sinApacheMahout"></a> +### RBM's in Apache Mahout +An implementation of Restricted Boltzmann Machines is being developed for +Apache Mahout as a Google Summer of Code 2010 project. A recommender +interface will also be provided. The key aims of the implementation are: +1. Accurate - should replicate known results, including those of the Netflix +Prize +1. Fast - The implementation uses Map-Reduce, hence, it should be fast +1. Scale - Should scale to large datasets, with a design whose critical +parts don't need a dependency between the amount of memory on your cluster +systems and the size of your dataset + +You can view the patch as it develops [here](http://github.com/sisirkoppaka/mahout-rbm/compare/trunk...rbm) +. http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/support-vector-machines.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/classification/support-vector-machines.md b/website/old_site_migration/needs_work_convenience/map-reduce/classification/support-vector-machines.md new file mode 100644 index 0000000..6d1b9df --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/classification/support-vector-machines.md @@ -0,0 +1,43 @@ +--- +layout: default +title: Support Vector Machines +theme: + name: retro-mahout +--- + +<a name="SupportVectorMachines-SupportVectorMachines"></a> +# Support Vector Machines + +As with Naive Bayes, Support Vector Machines (or SVMs in short) can be used +to solve the task of assigning objects to classes. However, the way this +task is solved is completely different to the setting in Naive Bayes. + +Each object is considered to be a point in _n_ dimensional feature space, +_n_ being the number of features used to describe the objects numerically. +In addition each object is assigned a binary label, let us assume the +labels are "positive" and "negative". During learning, the algorithm tries +to find a hyperplane in that space, that perfectly separates positive from +negative objects. +It is trivial to think of settings where this might very well be +impossible. To remedy this situation, objects can be assigned so called +slack terms, that punish mistakes made during learning appropriately. That +way, the algorithm is forced to find the hyperplane that causes the least +number of mistakes. + +Another way to overcome the problem of there being no linear hyperplane to +separate positive from negative objects is to simply project each feature +vector into an higher dimensional feature space and search for a linear +separating hyperplane in that new space. Usually the main problem with +learning in high dimensional feature spaces is the so called curse of +dimensionality. That is, there are fewer learning examples available than +free parameters to tune. In the case of SVMs this problem is less +detrimental, as SVMs impose additional structural constraints on their +solutions. Each separating hyperplane needs to have a maximal margin to all +training examples. In addition, that way, the solution may be based on the +information encoded in only very few examples. + +<a name="SupportVectorMachines-Strategyforparallelization"></a> +## Strategy for parallelization + +<a name="SupportVectorMachines-Designofpackages"></a> +## Design of packages http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/twenty-newsgroups.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/classification/twenty-newsgroups.md b/website/old_site_migration/needs_work_convenience/map-reduce/classification/twenty-newsgroups.md new file mode 100644 index 0000000..472aaf6 --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/classification/twenty-newsgroups.md @@ -0,0 +1,179 @@ +--- +layout: default +title: Twenty Newsgroups +theme: + name: retro-mahout +--- + + +<a name="TwentyNewsgroups-TwentyNewsgroupsClassificationExample"></a> +## Twenty Newsgroups Classification Example + +<a name="TwentyNewsgroups-Introduction"></a> +## Introduction + +The 20 newsgroups dataset is a collection of approximately 20,000 +newsgroup documents, partitioned (nearly) evenly across 20 different +newsgroups. The 20 newsgroups collection has become a popular data set for +experiments in text applications of machine learning techniques, such as +text classification and text clustering. We will use the [Mahout CBayes](http://mahout.apache.org/users/mapreduce/classification/bayesian.html) +classifier to create a model that would classify a new document into one of +the 20 newsgroups. + +<a name="TwentyNewsgroups-Prerequisites"></a> +### Prerequisites + +* Mahout has been downloaded ([instructions here](https://mahout.apache.org/general/downloads.html)) +* Maven is available +* Your environment has the following variables: + * **HADOOP_HOME** Environment variables refers to where Hadoop lives + * **MAHOUT_HOME** Environment variables refers to where Mahout lives + +<a name="TwentyNewsgroups-Instructionsforrunningtheexample"></a> +### Instructions for running the example + +1. If running Hadoop in cluster mode, start the hadoop daemons by executing the following commands: + + $ cd $HADOOP_HOME/bin + $ ./start-all.sh + + Otherwise: + + $ export MAHOUT_LOCAL=true + +2. In the trunk directory of Mahout, compile and install Mahout: + + $ cd $MAHOUT_HOME + $ mvn -DskipTests clean install + +3. Run the [20 newsgroups example script](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh) by executing: + + $ ./examples/bin/classify-20newsgroups.sh + +4. You will be prompted to select a classification method algorithm: + + 1. Complement Naive Bayes + 2. Naive Bayes + 3. Stochastic Gradient Descent + +Select 1 and the the script will perform the following: + +1. Create a working directory for the dataset and all input/output. +2. Download and extract the *20news-bydate.tar.gz* from the [20 newsgroups dataset](http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz) to the working directory. +3. Convert the full 20 newsgroups dataset into a < Text, Text > SequenceFile. +4. Convert and preprocesses the dataset into a < Text, VectorWritable > SequenceFile containing term frequencies for each document. +5. Split the preprocessed dataset into training and testing sets. +6. Train the classifier. +7. Test the classifier. + + +Output should look something like: + + + ======================================================= + Confusion Matrix + ------------------------------------------------------- + a b c d e f g h i j k l m n o p q r s t <--Classified as + 381 0 0 0 0 9 1 0 0 0 1 0 0 2 0 1 0 0 3 0 |398 a=rec.motorcycles + 1 284 0 0 0 0 1 0 6 3 11 0 66 3 0 6 0 4 9 0 |395 b=comp.windows.x + 2 0 339 2 0 3 5 1 0 0 0 0 1 1 12 1 7 0 2 0 |376 c=talk.politics.mideast + 4 0 1 327 0 2 2 0 0 2 1 1 0 5 1 4 12 0 2 0 |364 d=talk.politics.guns + 7 0 4 32 27 7 7 2 0 12 0 0 6 0 100 9 7 31 0 0 |251 e=talk.religion.misc + 10 0 0 0 0 359 2 2 0 0 3 0 1 6 0 1 0 0 11 0 |396 f=rec.autos + 0 0 0 0 0 1 383 9 1 0 0 0 0 0 0 0 0 3 0 0 |397 g=rec.sport.baseball + 1 0 0 0 0 0 9 382 0 0 0 0 1 1 1 0 2 0 2 0 |399 h=rec.sport.hockey + 2 0 0 0 0 4 3 0 330 4 4 0 5 12 0 0 2 0 12 7 |385 i=comp.sys.mac.hardware + 0 3 0 0 0 0 1 0 0 368 0 0 10 4 1 3 2 0 2 0 |394 j=sci.space + 0 0 0 0 0 3 1 0 27 2 291 0 11 25 0 0 1 0 13 18|392 k=comp.sys.ibm.pc.hardware + 8 0 1 109 0 6 11 4 1 18 0 98 1 3 11 10 27 1 1 0 |310 l=talk.politics.misc + 0 11 0 0 0 3 6 0 10 6 11 0 299 13 0 2 13 0 7 8 |389 m=comp.graphics + 6 0 1 0 0 4 2 0 5 2 12 0 8 321 0 4 14 0 8 6 |393 n=sci.electronics + 2 0 0 0 0 0 4 1 0 3 1 0 3 1 372 6 0 2 1 2 |398 o=soc.religion.christian + 4 0 0 1 0 2 3 3 0 4 2 0 7 12 6 342 1 0 9 0 |396 p=sci.med + 0 1 0 1 0 1 4 0 3 0 1 0 8 4 0 2 369 0 1 1 |396 q=sci.crypt + 10 0 4 10 1 5 6 2 2 6 2 0 2 1 86 15 14 152 0 1 |319 r=alt.atheism + 4 0 0 0 0 9 1 1 8 1 12 0 3 0 2 0 0 0 341 2 |390 s=misc.forsale + 8 5 0 0 0 1 6 0 8 5 50 0 40 2 1 0 9 0 3 256|394 t=comp.os.ms-windows.misc + ======================================================= + Statistics + ------------------------------------------------------- + Kappa 0.8808 + Accuracy 90.8596% + Reliability 86.3632% + Reliability (standard deviation) 0.2131 + + + + + +<a name="TwentyNewsgroups-ComplementaryNaiveBayes"></a> +## End to end commands to build a CBayes model for 20 newsgroups +The [20 newsgroups example script](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh) issues the following commands as outlined above. We can build a CBayes classifier from the command line by following the process in the script: + +*Be sure that **MAHOUT_HOME**/bin and **HADOOP_HOME**/bin are in your **$PATH*** + +1. Create a working directory for the dataset and all input/output. + + $ export WORK_DIR=/tmp/mahout-work-${USER} + $ mkdir -p ${WORK_DIR} + +2. Download and extract the *20news-bydate.tar.gz* from the [20newsgroups dataset](http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz) to the working directory. + + $ curl http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz + -o ${WORK_DIR}/20news-bydate.tar.gz + $ mkdir -p ${WORK_DIR}/20news-bydate + $ cd ${WORK_DIR}/20news-bydate && tar xzf ../20news-bydate.tar.gz && cd .. && cd .. + $ mkdir ${WORK_DIR}/20news-all + $ cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all + * If you're running on a Hadoop cluster: + + $ hadoop dfs -put ${WORK_DIR}/20news-all ${WORK_DIR}/20news-all + +3. Convert the full 20 newsgroups dataset into a < Text, Text > SequenceFile. + + $ mahout seqdirectory + -i ${WORK_DIR}/20news-all + -o ${WORK_DIR}/20news-seq + -ow + +4. Convert and preprocesses the dataset into a < Text, VectorWritable > SequenceFile containing term frequencies for each document. + + $ mahout seq2sparse + -i ${WORK_DIR}/20news-seq + -o ${WORK_DIR}/20news-vectors + -lnorm + -nv + -wt tfidf +If we wanted to use different parsing methods or transformations on the term frequency vectors we could supply different options here e.g.: -ng 2 for bigrams or -n 2 for L2 length normalization. See the [Creating vectors from text](http://mahout.apache.org/users/basics/creating-vectors-from-text.html) page for a list of all seq2sparse options. + +5. Split the preprocessed dataset into training and testing sets. + + $ mahout split + -i ${WORK_DIR}/20news-vectors/tfidf-vectors + --trainingOutput ${WORK_DIR}/20news-train-vectors + --testOutput ${WORK_DIR}/20news-test-vectors + --randomSelectionPct 40 + --overwrite --sequenceFiles -xm sequential + +6. Train the classifier. + + $ mahout trainnb + -i ${WORK_DIR}/20news-train-vectors + -el + -o ${WORK_DIR}/model + -li ${WORK_DIR}/labelindex + -ow + -c + +7. Test the classifier. + + $ mahout testnb + -i ${WORK_DIR}/20news-test-vectors + -m ${WORK_DIR}/model + -l ${WORK_DIR}/labelindex + -ow + -o ${WORK_DIR}/20news-testing + -c + + + \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/clustering/20newsgroups.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/20newsgroups.md b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/20newsgroups.md new file mode 100644 index 0000000..da5174f --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/20newsgroups.md @@ -0,0 +1,11 @@ +--- +layout: default +title: 20Newsgroups +theme: + name: retro-mahout +--- + +<a name="20Newsgroups-NaiveBayesusing20NewsgroupsData"></a> +# Naive Bayes using 20 Newsgroups Data + +See [https://issues.apache.org/jira/browse/MAHOUT-9](https://issues.apache.org/jira/browse/MAHOUT-9) http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/clustering/canopy-clustering.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/canopy-clustering.md b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/canopy-clustering.md new file mode 100644 index 0000000..eb4c845 --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/canopy-clustering.md @@ -0,0 +1,188 @@ +--- +layout: default +title: Canopy Clustering +theme: + name: retro-mahout +--- + +<a name="CanopyClustering-CanopyClustering"></a> +# Canopy Clustering + +[Canopy Clustering](http://www.kamalnigam.com/papers/canopy-kdd00.pdf) + is a very simple, fast and surprisingly accurate method for grouping +objects into clusters. All objects are represented as a point in a +multidimensional feature space. The algorithm uses a fast approximate +distance metric and two distance thresholds T1 > T2 for processing. The +basic algorithm is to begin with a set of points and remove one at random. +Create a Canopy containing this point and iterate through the remainder of +the point set. At each point, if its distance from the first point is < T1, +then add the point to the cluster. If, in addition, the distance is < T2, +then remove the point from the set. This way points that are very close to +the original will avoid all further processing. The algorithm loops until +the initial set is empty, accumulating a set of Canopies, each containing +one or more points. A given point may occur in more than one Canopy. + +Canopy Clustering is often used as an initial step in more rigorous +clustering techniques, such as [K-Means Clustering](k-means-clustering.html) +. By starting with an initial clustering the number of more expensive +distance measurements can be significantly reduced by ignoring points +outside of the initial canopies. + +**WARNING**: Canopy is deprecated in the latest release and will be removed once streaming k-means becomes stable enough. + +<a name="CanopyClustering-Strategyforparallelization"></a> +## Strategy for parallelization + +Looking at the sample Hadoop implementation in [http://code.google.com/p/canopy-clustering/](http://code.google.com/p/canopy-clustering/) + the processing is done in 3 M/R steps: +1. The data is massaged into suitable input format +1. Each mapper performs canopy clustering on the points in its input set and +outputs its canopies' centers +1. The reducer clusters the canopy centers to produce the final canopy +centers +1. The points are then clustered into these final canopies + +Some ideas can be found in [Cluster computing and MapReduce](https://www.youtube.com/watch?v=yjPBkvYh-ss&list=PLEFAB97242917704A) + lecture video series \[by Google(r)\]; Canopy Clustering is discussed in [lecture #4](https://www.youtube.com/watch?v=1ZDybXl212Q) +. Finally here is the [Wikipedia page](http://en.wikipedia.org/wiki/Canopy_clustering_algorithm) +. + +<a name="CanopyClustering-Designofimplementation"></a> +## Design of implementation + +The implementation accepts as input Hadoop SequenceFiles containing +multidimensional points (VectorWritable). Points may be expressed either as +dense or sparse Vectors and processing is done in two phases: Canopy +generation and, optionally, Clustering. + +<a name="CanopyClustering-Canopygenerationphase"></a> +### Canopy generation phase + +During the map step, each mapper processes a subset of the total points and +applies the chosen distance measure and thresholds to generate canopies. In +the mapper, each point which is found to be within an existing canopy will +be added to an internal list of Canopies. After observing all its input +vectors, the mapper updates all of its Canopies and normalizes their totals +to produce canopy centroids which are output, using a constant key +("centroid") to a single reducer. The reducer receives all of the initial +centroids and again applies the canopy measure and thresholds to produce a +final set of canopy centroids which is output (i.e. clustering the cluster +centroids). The reducer output format is: SequenceFile(Text, Canopy) with +the _key_ encoding the canopy identifier. + +<a name="CanopyClustering-Clusteringphase"></a> +### Clustering phase + +During the clustering phase, each mapper reads the Canopies produced by the +first phase. Since all mappers have the same canopy definitions, their +outputs will be combined during the shuffle so that each reducer (many are +allowed here) will see all of the points assigned to one or more canopies. +The output format will then be: SequenceFile(IntWritable, +WeightedVectorWritable) with the _key_ encoding the canopyId. The +WeightedVectorWritable has two fields: a double weight and a VectorWritable +vector. Together they encode the probability that each vector is a member +of the given canopy. + +<a name="CanopyClustering-RunningCanopyClustering"></a> +## Running Canopy Clustering + +The canopy clustering algorithm may be run using a command-line invocation +on CanopyDriver.main or by making a Java call to CanopyDriver.run(...). +Both require several arguments: + +Invocation using the command line takes the form: + + + bin/mahout canopy \ + -i <input vectors directory> \ + -o <output working directory> \ + -dm <DistanceMeasure> \ + -t1 <T1 threshold> \ + -t2 <T2 threshold> \ + -t3 <optional reducer T1 threshold> \ + -t4 <optional reducer T2 threshold> \ + -cf <optional cluster filter size (default: 0)> \ + -ow <overwrite output directory if present> + -cl <run input vector clustering after computing Canopies> + -xm <execution method: sequential or mapreduce> + + +Invocation using Java involves supplying the following arguments: + +1. input: a file path string to a directory containing the input data set a +SequenceFile(WritableComparable, VectorWritable). The sequence file _key_ +is not used. +1. output: a file path string to an empty directory which is used for all +output from the algorithm. +1. measure: the fully-qualified class name of an instance of DistanceMeasure +which will be used for the clustering. +1. t1: the T1 distance threshold used for clustering. +1. t2: the T2 distance threshold used for clustering. +1. t3: the optional T1 distance threshold used by the reducer for +clustering. If not specified, T1 is used by the reducer. +1. t4: the optional T2 distance threshold used by the reducer for +clustering. If not specified, T2 is used by the reducer. +1. clusterFilter: the minimum size for canopies to be output by the +algorithm. Affects both sequential and mapreduce execution modes, and +mapper and reducer outputs. +1. runClustering: a boolean indicating, if true, that the clustering step is +to be executed after clusters have been determined. +1. runSequential: a boolean indicating, if true, that the computation is to +be run in memory using the reference Canopy implementation. Note: that the +sequential implementation performs a single pass through the input vectors +whereas the MapReduce implementation performs two passes (once in the +mapper and again in the reducer). The MapReduce implementation will +typically produce less clusters than the sequential implementation as a +result. + +After running the algorithm, the output directory will contain: +1. clusters-0: a directory containing SequenceFiles(Text, Canopy) produced +by the algorithm. The Text _key_ contains the cluster identifier of the +Canopy. +1. clusteredPoints: (if runClustering enabled) a directory containing +SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable _key_ is +the canopyId. The WeightedVectorWritable _value_ is a bean containing a +double _weight_ and a VectorWritable _vector_ where the weight indicates +the probability that the vector is a member of the canopy. For canopy +clustering, the weights are computed as 1/(1+distance) where the distance +is between the cluster center and the vector using the chosen +DistanceMeasure. + +<a name="CanopyClustering-Examples"></a> +# Examples + +The following images illustrate Canopy clustering applied to a set of +randomly-generated 2-d data points. The points are generated using a normal +distribution centered at a mean location and with a constant standard +deviation. See the README file in the [/examples/src/main/java/org/apache/mahout/clustering/display/README.txt](https://github.com/apache/mahout/blob/master/examples/src/main/java/org/apache/mahout/clustering/display/README.txt) + for details on running similar examples. + +The points are generated as follows: + +* 500 samples m=\[1.0, 1.0\](1.0,-1.0\.html) + sd=3.0 +* 300 samples m=\[1.0, 0.0\](1.0,-0.0\.html) + sd=0.5 +* 300 samples m=\[0.0, 2.0\](0.0,-2.0\.html) + sd=0.1 + +In the first image, the points are plotted and the 3-sigma boundaries of +their generator are superimposed. + + + +In the second image, the resulting canopies are shown superimposed upon the +sample data. Each canopy is represented by two circles, with radius T1 and +radius T2. + + + +The third image uses the same values of T1 and T2 but only superimposes +canopies covering more than 10% of the population. This is a bit better +representation of the data but it still has lots of room for improvement. +The advantage of Canopy clustering is that it is single-pass and fast +enough to iterate runs using different T1, T2 parameters and display +thresholds. + + + http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/clustering/canopy-commandline.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/canopy-commandline.md b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/canopy-commandline.md new file mode 100644 index 0000000..446faba --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/canopy-commandline.md @@ -0,0 +1,70 @@ +--- +layout: default +title: canopy-commandline +theme: + name: retro-mahout +--- + +<a name="canopy-commandline-RunningCanopyClusteringfromtheCommandLine"></a> +# Running Canopy Clustering from the Command Line +Mahout's Canopy clustering can be launched from the same command line +invocation whether you are running on a single machine in stand-alone mode +or on a larger Hadoop cluster. The difference is determined by the +$HADOOP_HOME and $HADOOP_CONF_DIR environment variables. If both are set to +an operating Hadoop cluster on the target machine then the invocation will +run Canopy on that cluster. If either of the environment variables are +missing then the stand-alone Hadoop configuration will be invoked instead. + + + ./bin/mahout canopy <OPTIONS> + + +* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job +will be generated in $MAHOUT_HOME/core/target/ and it's name will contain +the Mahout version number. For example, when using Mahout 0.3 release, the +job will be mahout-core-0.3.job + + +<a name="canopy-commandline-Testingitononesinglemachinew/ocluster"></a> +## Testing it on one single machine w/o cluster + +* Put the data: cp <PATH TO DATA> testdata +* Run the Job: + + ./bin/mahout canopy -i testdata -o output -dm +org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 5 -t2 2 + + +<a name="canopy-commandline-Runningitonthecluster"></a> +## Running it on the cluster + +* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh +* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata +* Run the Job: + + export HADOOP_HOME=<Hadoop Home Directory> + export HADOOP_CONF_DIR=$HADOOP_HOME/conf + ./bin/mahout canopy -i testdata -o output -dm +org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 5 -t2 2 + +* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output +to view all outputs. + +<a name="canopy-commandline-Commandlineoptions"></a> +# Command line options + + --input (-i) input Path to job input directory.Must + be a SequenceFile of + VectorWritable + --output (-o) output The directory pathname for output. + --overwrite (-ow) If present, overwrite the output + directory before running job + --distanceMeasure (-dm) distanceMeasure The classname of the + DistanceMeasure. Default is + SquaredEuclidean + --t1 (-t1) t1 T1 threshold value + --t2 (-t2) t2 T2 threshold value + --clustering (-cl) If present, run clustering after + the iterations have taken place + --help (-h) Print out help + http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/clustering/cluster-dumper.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/cluster-dumper.md b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/cluster-dumper.md new file mode 100644 index 0000000..454734b --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/cluster-dumper.md @@ -0,0 +1,106 @@ +--- +layout: default +title: Cluster Dumper +theme: + name: retro-mahout +--- + +<a name="ClusterDumper-Introduction"></a> +## Cluster Dumper - Introduction + +Clustering tasks in Mahout will output data in the format of a SequenceFile +(Text, Cluster) and the Text is a cluster identifier string. To analyze +this output we need to convert the sequence files to a human readable +format and this is achieved using the clusterdump utility. + +<a name="ClusterDumper-Stepsforanalyzingclusteroutputusingclusterdumputility"></a> +## Steps for analyzing cluster output using clusterdump utility + +After you've executed a clustering tasks (either examples or real-world), +you can run clusterdumper in 2 modes: + + +1. Hadoop Environment +1. Standalone Java Program + + +<a name="ClusterDumper-HadoopEnvironment{anchor:HadoopEnvironment}"></a> +### Hadoop Environment + +If you have setup your HADOOP_HOME environment variable, you can use the +command line utility `mahout` to execute the ClusterDumper on Hadoop. In +this case we wont need to get the output clusters to our local machines. +The utility will read the output clusters present in HDFS and output the +human-readable cluster values into our local file system. Say you've just +executed the [synthetic control example ](clustering-of-synthetic-control-data.html) + and want to analyze the output, you can execute the `mahout clusterdumper` utility from the command line. + +#### CLI options: + --help Print out help + --input (-i) input The directory containing Sequence + Files for the Clusters + --output (-o) output The output file. If not specified, + dumps to the console. + --outputFormat (-of) outputFormat The optional output format to write + the results as. Options: TEXT, CSV, or GRAPH_ML + --substring (-b) substring The number of chars of the + asFormatString() to print + --pointsDir (-p) pointsDir The directory containing points + sequence files mapping input vectors + to their cluster. If specified, + then the program will output the + points associated with a cluster + --dictionary (-d) dictionary The dictionary file. + --dictionaryType (-dt) dictionaryType The dictionary file type + (text|sequencefile) + --distanceMeasure (-dm) distanceMeasure The classname of the DistanceMeasure. + Default is SquaredEuclidean. + --numWords (-n) numWords The number of top terms to print + --tempDir tempDir Intermediate output directory + --startPhase startPhase First phase to run + --endPhase endPhase Last phase to run + --evaluate (-e) Run ClusterEvaluator and CDbwEvaluator over the + input. The output will be appended to the rest of + the output at the end. + +### Standalone Java Program + +Run the clusterdump utility as follows as a standalone Java Program through Eclipse. <!-- - if you are using eclipse, setup mahout-utils as a project as specified in [Working with Maven in Eclipse](../../developers/buildingmahout.html). --> + To execute ClusterDumper.java, + +* Under mahout-utils, Right-Click on ClusterDumper.java +* Choose Run-As, Run Configurations +* On the left menu, click on Java Application +* On the top-bar click on "New Launch Configuration" +* A new launch should be automatically created with project as + + "mahout-utils" and Main Class as "org.apache.mahout.utils.clustering.ClusterDumper" + +In the arguments tab, specify the below arguments + + + --seqFileDir <MAHOUT_HOME>/examples/output/clusters-10 + --pointsDir <MAHOUT_HOME>/examples/output/clusteredPoints + --output <MAHOUT_HOME>/examples/output/clusteranalyze.txt + replace <MAHOUT_HOME> with the actual path of your $MAHOUT_HOME + +* Hit run to execute the ClusterDumper using Eclipse. Setting breakpoints etc should just work fine. + +Reading the output file + +This will output the clusters into a file called clusteranalyze.txt inside $MAHOUT_HOME/examples/output +Sample data will look like + +CL-0 { n=116 c=[29.922, 30.407, 30.373, 30.094, 29.886, 29.937, 29.751, 30.054, 30.039, 30.126, 29.764, 29.835, 30.503, 29.876, 29.990, 29.605, 29.379, 30.120, 29.882, 30.161, 29.825, 30.074, 30.001, 30.421, 29.867, 29.736, 29.760, 30.192, 30.134, 30.082, 29.962, 29.512, 29.736, 29.594, 29.493, 29.761, 29.183, 29.517, 29.273, 29.161, 29.215, 29.731, 29.154, 29.113, 29.348, 28.981, 29.543, 29.192, 29.479, 29.406, 29.715, 29.344, 29.628, 29.074, 29.347, 29.812, 29.058, 29.177, 29.063, 29.607](29.922,-30.407,-30.373,-30.094,-29.886,-29.937,-29.751,-30.054,-30.039,-30.126,-29.764,-29.835,-30.503,-29.876,-29.990,-29.605,-29.379,-30.120,-29.882,-30.161,-29.825,-30.074,-30.001,-30.421,-29.867,-29.736,-29.760,-30.192,-30.134,-30.082,-29.962,-29.512,-29.736,-29.594,-29.493,-29.761,-29.183,-29.517,-29.273,-29.161,-29.215,-29.731,-29.154,-29.113,-29.348,-28.981,-29.543,-29.192,-29.479,-29.406,-29.715,-29.344,-29.628,-29.074,-29.347,-29.812,-29.058,-29.177,-29.063,-29.607.html) + r=[3.463, 3.351, 3.452, 3.438, 3.371, 3.569, 3.253, 3.531, 3.439, 3.472, +3.402, 3.459, 3.320, 3.260, 3.430, 3.452, 3.320, 3.499, 3.302, 3.511, +3.520, 3.447, 3.516, 3.485, 3.345, 3.178, 3.492, 3.434, 3.619, 3.483, +3.651, 3.833, 3.812, 3.433, 4.133, 3.855, 4.123, 3.999, 4.467, 4.731, +4.539, 4.956, 4.644, 4.382, 4.277, 4.918, 4.784, 4.582, 4.915, 4.607, +4.672, 4.577, 5.035, 5.241, 4.731, 4.688, 4.685, 4.657, 4.912, 4.300] } + +and on... + +where CL-0 is the Cluster 0 and n=116 refers to the number of points observed by this cluster and c = \[29.922 ...\] + refers to the center of Cluster as a vector and r = \[3.463 ..\] refers to +the radius of the cluster as a vector. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/clustering/clustering-of-synthetic-control-data.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/clustering-of-synthetic-control-data.md b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/clustering-of-synthetic-control-data.md new file mode 100644 index 0000000..693568f --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/clustering-of-synthetic-control-data.md @@ -0,0 +1,53 @@ +--- +layout: default +title: Clustering of synthetic control data +theme: + name: retro-mahout +--- + +# Clustering synthetic control data + +## Introduction + +This example will demonstrate clustering of time series data, specifically control charts. [Control charts](http://en.wikipedia.org/wiki/Control_chart) are tools used to determine whether a manufacturing or business process is in a state of statistical control. Such control charts are generated / simulated repeatedly at equal time intervals. A [simulated dataset](http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html) is available for use in UCI machine learning repository. + +A time series of control charts needs to be clustered into their close knit groups. The data set we use is synthetic and is meant to resemble real world information in an anonymized format. It contains six different classes: Normal, Cyclic, Increasing trend, Decreasing trend, Upward shift, Downward shift. In this example we will use Mahout to cluster the data into corresponding class buckets. + +*For the sake of simplicity, we won't use a cluster in this example, but instead show you the commands to run the clustering examples locally with Hadoop*. + +## Setup + +We need to do some initial setup before we are able to run the example. + + + 1. Start out by downloading the dataset to be clustered from the UCI Machine Learning Repository: [http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data](http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data). + + 2. Download the [latest release of Mahout](/general/downloads.html). + + 3. Unpack the release binary and switch to the *mahout-distribution-0.x* folder + + 4. Make sure that the *JAVA_HOME* environment variable points to your local java installation + + 5. Create a folder called *testdata* in the current directory and copy the dataset into this folder. + + +## Clustering Examples + +Depending on the clustering algorithm you want to run, the following commands can be used: + + + * [Canopy Clustering](/users/clustering/canopy-clustering.html) + + bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job + + * [k-Means Clustering](/users/clustering/k-means-clustering.html) + + bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job + + + * [Fuzzy k-Means Clustering](/users/clustering/fuzzy-k-means.html) + + bin/mahout org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job + +The clustering output will be produced in the *output* directory. The output data points are in vector format. In order to read/analyze the output, you can use the [clusterdump](/users/clustering/cluster-dumper.html) utility provided by Mahout. + http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/clustering/clustering-seinfeld-episodes.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/clustering-seinfeld-episodes.md b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/clustering-seinfeld-episodes.md new file mode 100644 index 0000000..8a983fc --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/clustering-seinfeld-episodes.md @@ -0,0 +1,11 @@ +--- +layout: default +title: Clustering Seinfeld Episodes +theme: + name: retro-mahout +--- + +Below is short tutorial on how to cluster Seinfeld episode transcripts with +Mahout. + +http://blog.jteam.nl/2011/04/04/how-to-cluster-seinfeld-episodes-with-mahout/ http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/clustering/clusteringyourdata.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/clusteringyourdata.md b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/clusteringyourdata.md new file mode 100644 index 0000000..bdd6d01 --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/clusteringyourdata.md @@ -0,0 +1,126 @@ +--- +layout: default +title: ClusteringYourData +theme: + name: retro-mahout +--- + +# Clustering your data + +After you've done the [Quickstart](quickstart.html) and are familiar with the basics of Mahout, it is time to cluster your own +data. See also [Wikipedia on cluster analysis](en.wikipedia.org/wiki/Cluster_analysis) for more background. + +The following pieces *may* be useful for in getting started: + +<a name="ClusteringYourData-Input"></a> +# Input + +For starters, you will need your data in an appropriate Vector format, see [Creating Vectors](../basics/creating-vectors.html). +In particular for text preparation check out [Creating Vectors from Text](../basics/creating-vectors-from-text.html). + + +<a name="ClusteringYourData-RunningtheProcess"></a> +# Running the Process + +* [Canopy background](canopy-clustering.html) and [canopy-commandline](canopy-commandline.html). + +* [K-Means background](k-means-clustering.html), [k-means-commandline](k-means-commandline.html), and +[fuzzy-k-means-commandline](fuzzy-k-means-commandline.html). + +* [Dirichlet background](dirichlet-process-clustering.html) and [dirichlet-commandline](dirichlet-commandline.html). + +* [Meanshift background](mean-shift-clustering.html) and [mean-shift-commandline](mean-shift-commandline.html). + +* [LDA (Latent Dirichlet Allocation) background](-latent-dirichlet-allocation.html) and [lda-commandline](lda-commandline.html). + +* TODO: kmeans++/ streaming kMeans documentation + + +<a name="ClusteringYourData-RetrievingtheOutput"></a> +# Retrieving the Output + +Mahout has a cluster dumper utility that can be used to retrieve and evaluate your clustering data. + + ./bin/mahout clusterdump <OPTIONS> + + +<a name="ClusteringYourData-Theclusterdumperoptionsare:"></a> +## The cluster dumper options are: + + --help (-h) Print out help + + --input (-i) input The directory containing Sequence + Files for the Clusters + + --output (-o) output The output file. If not specified, + dumps to the console. + + --outputFormat (-of) outputFormat The optional output format to write + the results as. Options: TEXT, CSV, or GRAPH_ML + + --substring (-b) substring The number of chars of the + asFormatString() to print + + --pointsDir (-p) pointsDir The directory containing points + sequence files mapping input vectors to their cluster. If specified, + then the program will output the + points associated with a cluster + + --dictionary (-d) dictionary The dictionary file. + + --dictionaryType (-dt) dictionaryType The dictionary file type + (text|sequencefile) + + --distanceMeasure (-dm) distanceMeasure The classname of the DistanceMeasure. + Default is SquaredEuclidean. + + --numWords (-n) numWords The number of top terms to print + + --tempDir tempDir Intermediate output directory + + --startPhase startPhase First phase to run + + --endPhase endPhase Last phase to run + + --evaluate (-e) Run ClusterEvaluator and CDbwEvaluator over the + input. The output will be appended to the rest of + the output at the end. + + +More information on using clusterdump utility can be found [here](cluster-dumper.html) + +<a name="ClusteringYourData-ValidatingtheOutput"></a> +# Validating the Output + +{quote} +Ted Dunning: A principled approach to cluster evaluation is to measure how well the +cluster membership captures the structure of unseen data. A natural +measure for this is to measure how much of the entropy of the data is +captured by cluster membership. For k-means and its natural L_2 metric, +the natural cluster quality metric is the squared distance from the nearest +centroid adjusted by the log_2 of the number of clusters. This can be +compared to the squared magnitude of the original data or the squared +deviation from the centroid for all of the data. The idea is that you are +changing the representation of the data by allocating some of the bits in +your original representation to represent which cluster each point is in. +If those bits aren't made up by the residue being small then your +clustering is making a bad trade-off. + +In the past, I have used other more heuristic measures as well. One of the +key characteristics that I would like to see out of a clustering is a +degree of stability. Thus, I look at the fractions of points that are +assigned to each cluster or the distribution of distances from the cluster +centroid. These values should be relatively stable when applied to held-out +data. + +For text, you can actually compute perplexity which measures how well +cluster membership predicts what words are used. This is nice because you +don't have to worry about the entropy of real valued numbers. + +Manual inspection and the so-called laugh test is also important. The idea +is that the results should not be so ludicrous as to make you laugh. +Unfortunately, it is pretty easy to kid yourself into thinking your system +is working using this kind of inspection. The problem is that we are too +good at seeing (making up) patterns. +{quote} + http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/clustering/expectation-maximization.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/clustering/expectation-maximization.md b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/expectation-maximization.md new file mode 100644 index 0000000..6ccc8c3 --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/clustering/expectation-maximization.md @@ -0,0 +1,62 @@ +--- +layout: default +title: Expectation Maximization +theme: + name: retro-mahout +--- +<a name="ExpectationMaximization-ExpectationMaximization"></a> +# Expectation Maximization + +The principle of EM can be applied to several learning settings, but is +most commonly associated with clustering. The main principle of the +algorithm is comparable to k-Means. Yet in contrast to hard cluster +assignments, each object is given some probability to belong to a cluster. +Accordingly cluster centers are recomputed based on the average of all +objects weighted by their probability of belonging to the cluster at hand. + +<a name="ExpectationMaximization-Canopy-modifiedEM"></a> +## Canopy-modified EM + +One can also use the canopies idea to speed up prototypebased clustering +methods like K-means and Expectation-Maximization (EM). In general, neither +K-means nor EMspecify how many clusters to use. The canopies technique does +not help this choice. + +Prototypes (our estimates of the cluster centroids) are associated with the +canopies that contain them, and the prototypes are only influenced by data +that are inside their associated canopies. After creating the canopies, we +decide how many prototypes will be created for each canopy. This could be +done, for example, using the number of data points in a canopy and AIC or +BIC where points that occur in more than one canopy are counted +fractionally. Then we place prototypesinto each canopy. This initial +placement can be random, as long as it is within the canopy in question, as +determined by the inexpensive distance metric. + +Then, instead of calculating the distance from each prototype to every +point (as is traditional, a O(nk) operation), theE-step instead calculates +the distance from each prototype to a much smaller number of points. For +each prototype, we find the canopies that contain it (using the cheap +distance metric), and only calculate distances (using the expensive +distance metric) from that prototype to points within those canopies. + +Note that by this procedure prototypes may move across canopy boundaries +when canopies overlap. Prototypes may move to cover the data in the +overlapping region, and then move entirely into another canopy in order to +cover data there. + +The canopy-modified EM algorithm behaves very similarly to traditional EM, +with the slight difference that points outside the canopy have no influence +on points in the canopy, rather than a minute influence. If the canopy +property holds, and points in the same cluster fall in the same canopy, +then the canopy-modified EM will almost always converge to the same maximum +in likelihood as the traditional EM. In fact, the difference in each +iterative step (apart from the enormous computational savings of computing +fewer terms) will be negligible since points outside the canopy will have +exponentially small influence. + +<a name="ExpectationMaximization-StrategyforParallelization"></a> +## Strategy for Parallelization + +<a name="ExpectationMaximization-Map/ReduceImplementation"></a> +## Map/Reduce Implementation +
