[jira] [Resolved] (MAHOUT-1451) Cleaning up the examples for clustering on the website

Sebastian Schelter (JIRA) Thu, 13 Mar 2014 05:05:36 -0700

     [ 
https://issues.apache.org/jira/browse/MAHOUT-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sebastian Schelter resolved MAHOUT-1451.
----------------------------------------

    Resolution: Fixed
      Assignee: Sebastian Schelter

> Cleaning up the examples for clustering on the website
> ------------------------------------------------------
>
>                 Key: MAHOUT-1451
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1451
>             Project: Mahout
>          Issue Type: Documentation
>            Reporter: Gaurav Misra
>            Assignee: Sebastian Schelter
>         Attachments: clustering-of-synthetic-control-data.txt
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Cleaning up the following clustering examples:
> =====================================
> https://mahout.apache.org/users/clustering/clustering-of-synthetic-control-data.html
> Introduction
> This example will demonstrate clustering of time series data, specifically 
> control charts. [Control charts : http://en.wikipedia.org/wiki/Control_chart] 
> are tools used to determine whether a manufacturing or business process is in 
> a state of statistical control. Such control charts are generated / simulated 
> repeatedly at equal time intervals. A simulated dataset is available for use 
> in UCI machine learning repository. The data is described [here : 
> http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html].
> Problem Description
> A time series of control charts needs to be clustered into their close knit 
> groups. The data set we use is synthetic and is meant to resemble real world 
> information in an anonymized format. It contains six different classes: 
> Normal, Cyclic, Increasing trend, Decreasing trend, Upward shift, Downward 
> shift. In this example we will use Mahout to cluster the data into 
> corresponding class buckets. 
> At the end of this example
>    * You will have clustered data using mahout.
>    * You will see how to analyse the clusters produced by mahout.  
>    * You will have a starting point for incorporating clustering into your 
> own software.
> Setup
> We need to do some initial setup before we are able to run the example. 
>   1. Start out by downloading the input dataset (to be clustered) from the 
> UCI Machine Learning Repository: 
> http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data
>   2. Make sure the data consists of 600 rows and 60 columns. The first 100 
> rows contains Normal data followed by 100 rows of Cyclic data and so on with 
> a total of 6 classes.
>   3. This example assumes that you have already set up Mahout/Hadoop. If you 
> have not done so yet:
>   4. 
>       * Hadoop: Follow the instructions on 
> http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleNodeSetup.html
>  to set up Hadoop.
>       * Mahout: Follow the instructions on the [Quickstart: 
> https://mahout.apache.org/users/basics/quickstart.html] page.
>   5. Make sure the Hadoop daemons are running if you are running Hadoop in 
> distributed mode. 
>   6. Create a directory on your local machine called « testdata » and place 
> the input dataset in this directory.
>   7. Run the following command to copy the input data into HDFS:
>       * Create a directory called « testdata »  on HDFS: 
>                          $HADOOP_HOME/bin/hadoop fs -mkdir testdata
>       * Copy the directory named « testdata » from your local filesystem to 
> HDFS: 
>                          $HADOOP_HOME/bin/hadoop fs -put testdata
>   8. The final setup step is to build Mahout by going to the $MAHOUT_HOME 
> directory and running one of the following commands: 
>   9. 
>       * For a full build: mvn clean install
>       * For a build without unit tests: mvn -DskipTests clean install
>   10. You should see a build successful message once the build script has 
> completed.
>   11. Finally make sure that the examples have compiled successfully. You 
> should find the compiled jar in the /examples/target directory under the name 
> mahout-examples-{version}.job.jar
>   12. This concludes all the setup required to run the examples.
> Clustering Examples
> There are examples available for three clustering algorithms:
>    * Canopy Clustering: 
> https://mahout.apache.org/users/clustering/canopy-clustering.html
>    * k-Means Clustering: 
> https://mahout.apache.org/users/clustering/k-means-clustering.html
>    * Fuzzy k-Means Clustering: 
> https://mahout.apache.org/users/clustering/fuzzy-k-means.html
> Depending on the example you want to run the following command can be used:
>    * Canopy Clustering: $MAHOUT_HOME/bin/mahout 
> org.apache.mahout.clustering.syntheticcontrol.canopy.Job
>    * k-Means Clustering: $MAHOUT_HOME/bin/mahout 
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
>    * Fuzzy k-Means Clustering: $MAHOUT_HOME/bin/mahout 
> org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job
> The clustering output will be produced in the « output » directory on HDFS. 
> The output should be copied to your local filesystem since it is overwritten 
> on each run.
> Use the following command to copy out the data to your local filesystem:
> $HADOOP_HOME/bin/hadoop fs -get output $MAHOUT_HOME/examples
> This creates an output folder inside examples directory. The output data 
> points are in vector format. In order to read/analyze the output, you can use 
> [clusterdump: https://mahout.apache.org/users/clustering/cluster-dumper.html] 
> utility provided by Mahout.
> The source code for these examples is located under the examples project.
> =====================================
> https://mahout.apache.org/users/clustering/clustering-seinfeld-episodes.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (MAHOUT-1451) Cleaning up the examples for clustering on the website

Reply via email to