[CONF] Apache Mahout > Clustering of synthetic control data

confluence Tue, 05 Oct 2010 03:24:28 -0700

Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Clustering of synthetic control data 
(https://cwiki.apache.org/confluence/display/MAHOUT/Clustering+of+synthetic+control+data)



Edited by Mat Kelcey:
---------------------------------------------------------------------
{toc}

h1. Introduction

The example will demonstrate clustering of control charts which exhibits a time 
series. [Control charts |http://en.wikipedia.org/wiki/Control_chart] are tools 
used to determine whether or not a manufacturing or business process is in a 
state of statistical control. Such control charts are generated / simulated 
over equal time interval and available for use in UCI machine learning 
database. The data is described [here 
|http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html].

h1. Problem description

A time series of control charts needs to be clustered into their close knit 
groups. The data set we use is synthetic and so resembles real world 
information in an anonymized format. It contains six different classes (Normal, 
Cyclic, Increasing trend, Decreasing trend, Upward shift, Downward shift). With 
these trends occurring on the input data set, the Mahout clustering algorithm 
will cluster the data into their corresponding class buckets. At the end of 
this example, you'll get to learn how to perform clustering using Mahout.

h1. Pre-Prep

Make sure you have the following covered before you work out the example.
# Input data set. Download it [here 
|http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data].
## Sample input data:
Input consists of 600 rows and 60 columns. The rows from  1 - 100 contains 
Normal data. Rows from 101 - 200 contains cyclic data and so on.. More info 
[here 
|http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html].
 Sample of how the data looks is like below.
|| \_time || \_time+x || \_time+2x || .. || \_time+60x ||
| 28.7812 | 34.4632 | 31.3381 | .. | 31.2834 |
| 24.8923 | 25.741 | 27.5532 | .. | 32.8217 |
..
..
| 35.5351 | 41.7067 | 39.1705 | 48.3964 | .. | 38.6103 |
| 24.2104 | 41.7679 | 45.2228 | 43.7762 | .. | 48.8175 |
..
..
# Setup Hadoop
## Assuming that you have installed the latest compatible Hadooop, start the 
daemons using {code}$HADOOP_HOME/bin/start-all.sh {code} If you have issues 
starting Hadoop, please reference the [Hadoop quick start guide | 
http://hadoop.apache.org/common/docs/current/]
## Copy the input to HDFS using {code}$HADOOP_HOME/bin/hadoop fs -put <PATH TO 
DATA> testdata {code}(HDFS input directory name should be testdata)
# Mahout Example job
Mahout's mahout-examples-$MAHOUT_VERSION.job does the actual clustering task 
and so it needs to be created. This can be done as
## cd $MAHOUT_HOME
## {code}mvn install{code} You will see BUILD SUCCESSFUL once all the 
corresponding tasks are through. The job will be generated in 
$MAHOUT_HOME/examples/target/ and it's name will contain the $MAHOUT_VERSION 
number. For example, when using Mahout 0.3 release, the job will be 
mahout-examples-0.3-SNAPSHOT.job
This completes the pre-requisites to perform clustering process using Mahout.

h1. Perform Clustering

With all the pre-work done, clustering the control data gets real simple.
# Depending on which clustering technique to use, you can invoke the 
corresponding job as below
## For [canopy |Canopy Clustering]:
{code} $MAHOUT_HOME/bin/mahout 
org.apache.mahout.clustering.syntheticcontrol.canopy.Job {code}
## For [kmeans |K-Means Clustering]:
{code} $MAHOUT_HOME/bin/mahout 
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job {code}
## For [fuzzykmeans |Fuzzy K-Means]:
{code} $MAHOUT_HOME/bin/mahout 
org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job {code}
## For [dirichlet |Dirichlet Process Clustering]:
{code} $MAHOUT_HOME/bin/mahout 
org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job {code}
## For [meanshift |Mean Shift Clustering]: {code}  $MAHOUT_HOME/bin/mahout 
org.apache.mahout.clustering.syntheticcontrol.meanshift.Job {code}
# Get the data out of HDFS{footnote}See [HDFS Shell | 
http://hadoop.apache.org/core/docs/current/hdfs_shell.html]{footnote}{footnote}The
 output directory is cleared when a new run starts so the results must be 
retrieved before a new run{footnote} and have a look{footnote}Dirichlet also 
prints data to console{footnote} by following the below steps

h1. Read / Analyze Output
In order to read/analyze the output, you can use [clusterdump|Cluster Dumper] 
utility provided by Mahout. If you want to just read the output, follow the 
below steps. 
# Use {code}$HADOOP_HOME/bin/hadoop fs -lsr output {code}to view all outputs.
# Use {code}$HADOOP_HOME/bin/hadoop fs -get output $MAHOUT_HOME/examples {code} 
to copy them all to your local machine and the output data points are in vector 
format. This creates an output folder inside examples directory.
# Computed clusters are contained in _output/clusters-i_
# All result clustered points are placed into _output/clusteredPoints_

{display-footnotes}

Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action

[CONF] Apache Mahout > Clustering of synthetic control data

Reply via email to