Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT) Page: Synthetic Control Data (https://cwiki.apache.org/confluence/display/MAHOUT/Synthetic+Control+Data)
Edited by Joe Prasanna Kumar: --------------------------------------------------------------------- h1. Introduction The goal of this example is to demonstrate clustering of control charts which exhibits a time series. [Control charts |http://en.wikipedia.org/wiki/Control_chart] are tools used to determine whether or not a manufacturing or business process is in a state of statistical control. Such control charts are generated / simulated over a time interval and available for use in UCI machine learning database. The data is described [here |http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html]. h1. Steps * Download the data at [http://archive.ics.uci.edu/ml/datasets/Synthetic+Control+Chart+Time+Series]. * In $MAHOUT_HOME/, build the Job file ** The same job is used for all examples so this only needs to be done once ** mvn install ** The job will be generated in $MAHOUT_HOME/examples/target/ and it's name will contain the $MAHOUT_VERSION number. For example, when using Mahout 0.3 release, the job will be mahout-examples-0.3.job * (Optional){footnote}This step should be skipped when using standalone Hadoop{footnote} Start up Hadoop: $HADOOP_HOME/bin/start-all.sh * Put the data: $HADOOP_HOME/bin/hadoop fs \-put <PATH TO DATA> testdata * Run the Job: $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job {footnote}Substitute in whichever Clustring Job you want here: KMeans, Canopy, etc. See subdirectories of $MAHOUT_HOME/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/.{footnote} ** For [canopy |Canopy Clustering]: $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.canopy.Job ** For [kmeans |K-Means Clustering]: $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job ** For [fuzzykmeans |Fuzzy K-Means]: $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job ** For [dirichlet |Dirichlet Process Clustering]: $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job ** For [meanshift |Mean Shift Clustering]: $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.meanshift.Job * Get the data out of HDFS{footnote}See [HDFS Shell | http://hadoop.apache.org/core/docs/current/hdfs_shell.html]{footnote}{footnote}The output directory is cleared when a new run starts so the results must be retrieved before a new run{footnote} and have a look{footnote}Dirichlet also prints data to console{footnote} ** All example jobs use _testdata_ as input and output to directory _output_ ** Use _bin/hadoop fs \-lsr output_ to view all outputs. Copy them all to your local machine and you can run the ClusterDumper on them. *** Sequence files containing the original points in Vector form are in _output/data_ *** Computed clusters are contained in _output/clusters-i_ *** All result clustered points are placed into _output/clusteredPoints_ {display-footnotes} Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action
