Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT) Page: Breiman Example (https://cwiki.apache.org/confluence/display/MAHOUT/Breiman+Example)
Change Comment: --------------------------------------------------------------------- Updated to reflect that package names have changed since the code snippets were written. Edited by Brian Stempin: --------------------------------------------------------------------- h1. Introduction This quick start page shows how to run the Breiman example. It implements the test procedure described in Breiman's paper [1]. The basic algorithm is as follows : * repeat I iterations * foreach iteration do ** 10% of the dataset is kept apart as a testing set ** build two forests using the training set, one with m=int(log2(M)+1) (called Random-Input) and one with m=1 (called Single-Input) ** choose the forest that gave the lowest oob error estimation to compute the test set error ** compute the test set error using the Single Input Forest (test error), this demonstrates that even with m=1, Decision Forests give comparable results to greater values of m ** compute the mean test set error using every tree of the chosen forest (tree error). This should indicate how well a single Decision Tree performs * compute the mean test error for all iterations * compute the mean tree error for all iterations h1. Steps h2. Download the data * The current implementation is compatible with the UCI repository file format. Here are links to some of the datasets used in Breiman's paper: ** glass : http://archive.ics.uci.edu/ml/datasets/Glass+Identification ** breast cancer : http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic) ** diabetes : http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes ** sonar : http://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks) ** ionosphere : http://archive.ics.uci.edu/ml/datasets/Ionosphere ** vehicle : [http://archive.ics.uci.edu/ml/datasets/Statlog+(Vehicle+Silhouettes)|http://archive.ics.uci.edu/ml/datasets/Statlog+(Vehicle+Silhouettes)] ** german : [http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)|http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)] * Put the data in HDFS: {code}$HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata{code} h2. Build the Job files * In $MAHOUT_HOME/ run: {code}mvn install -DskipTests{code} h2. Generate a file descriptor for the dataset: for the glass dataset (glass.data), run : {code} $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/core/target/mahout-core-<VERSION>-job.jar org.apache.mahout.classifier.df.tools.Describe -p testdata/glass.data -f testdata/glass.info -d I 9 N L {code} The "I 9 N L" string indicates the nature of the variables. which means 1 ignored(I) attribute, followed by 9 numerical(N) attributes, followed by the label(L) * you can also use C for categorical (nominal) attributes h2. Run the example {code} $HADOOP_HOME/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-<VERSION>-job.jar org.apache.mahout.classifier.df.BreimanExample -d testdata/glass.data -ds testdata/glass.info -i 10 -t 100 {code} which builds 100 trees (-t argument) and repeats the test 10 iterations (-i argument) * The example outputs the following results: ** Selection error : mean test error for the selected forest on all iterations ** Single Input error : mean test error for the single input forest on all iterations ** One Tree error : mean single tree error on all iterations ** Mean Random Input Time : mean build time for random input forests on all iterations ** Mean Single Input Time : mean build time for single input forests on all iterations Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action
