[CONF] Apache Mahout > Cluster Dumper

confluence Fri, 03 Sep 2010 04:15:30 -0700

Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Cluster Dumper 
(https://cwiki.apache.org/confluence/display/MAHOUT/Cluster+Dumper)



Edited by Joe Prasanna Kumar:
---------------------------------------------------------------------
h1. Introduction

Clustering tasks in Mahout will output data in the format of a SequenceFile 
(Text, Cluster) and the Text is a cluster identifier string. To analyze this 
output we need to convert the sequence files to a human readable format and 
this is achieved using the clusterdump utility.

h1. Steps for analyzing cluster output using clusterdump utility

After you've executed a clustering tasks (either examples or real-world), you 
can run clusterdumper in 2 modes.
# [Hadoop Environment| #HadoopEnvironment]
# [Standalone Java Program | #StandaloneJavaProgram]

h3. Hadoop Environment {anchor:HadoopEnvironment}

If you have setup your HADOOP_HOME environment variable, you can use the 
command line utility "mahout" to execute the ClusterDumper on Hadoop. In this 
case we wont need to get the output clusters to our local machines. The utility 
will read the output clusters present in HDFS and output the human-readable 
cluster values into our local file system. Say you've just executed the 
[syntetic control example |Clustering+of+synthetic+control+data] and want to 
analyze the output, you can execute
{code}$ $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-10 
--pointsDir output/clusteredPoints --output 
$MAHOUT_HOME/examples/output/clusteranalyze.txt {code}

h3. Standalone Java Program {anchor:StandaloneJavaProgram}

ClusterDumper can be run using CLI. If your HADOOP_HOME environment variable is 
not set, you can execute ClusterDumper using "mahout" command line utility.
# get the output data from hadoop into your local machine. For example, in the 
case where you've executed a clustering example use
{code} $HADOOP_HOME/bin/hadoop fs -get output $MAHOUT_HOME/examples {code}
This will create a folder called output inside your $MAHOUT_HOME/examples and 
will have sub-folders for each cluster outputs and ClusteredPoints
# Run the clusterdump utility as follows
$MAHOUT_HOME/bin/mahout clusterdump \--seqFileDir 
$MAHOUT_HOME/examples/output/clusters-10 \--pointsDir 
$MAHOUT_HOME/examples/output/clusteredPoints/ \--output 
$MAHOUT_HOME/examples/output/clusteranalyze.txt
h5. Standalone Java Program through Eclipse
If you are using eclipse, setup mahout-utils as a project as specified in 
[Working with Maven in Eclipse|BuildingMahout#mahout_maven_eclipse].
To execute ClusterDumper.java,

* Under mahout-utils, Right-Click on ClusterDumper.java
* Choose Run-As, Run Configurations
* On the left menu, click on Java Application
* On the top-bar click on "New Launch Configuration"
* A new launch should be automatically created with project as "mahout-utils" 
and Main Class as "org.apache.mahout.utils.clustering.ClusterDumper"
* In the arguments tab, specify the below arguments
\--seqFileDir <MAHOUT_HOME>/examples/output/clusters-10 \--pointsDir 
<MAHOUT_HOME>/examples/output/clusteredPoints \--output 
<MAHOUT_HOME>/examples/output/clusteranalyze.txt
replace <MAHOUT_HOME> with the actual path of your $MAHOUT_HOME
* Hit run to execute the ClusterDumper using Eclipse.
Setting breakpoints etc should just work fine.

h3. Reading the output file

This will output the clusters into a file called clusteranalyze.txt inside 
$MAHOUT_HOME/examples/output
Sample data will look like
CL-0 \{ n=116 c=[29.922, 30.407, 30.373, 30.094, 29.886, 29.937, 29.751, 
30.054, 30.039, 30.126, 29.764, 29.835, 30.503, 29.876, 29.990, 29.605, 29.379, 
30.120, 29.882, 30.161, 29.825, 30.074, 30.001, 30.421, 29.867, 29.736, 29.760, 
30.192, 30.134, 30.082, 29.962, 29.512, 29.736, 29.594, 29.493, 29.761, 29.183, 
29.517, 29.273, 29.161, 29.215, 29.731, 29.154, 29.113, 29.348, 28.981, 29.543, 
29.192, 29.479, 29.406, 29.715, 29.344, 29.628, 29.074, 29.347, 29.812, 29.058, 
29.177, 29.063, 29.607] r=[3.463, 3.351, 3.452, 3.438, 3.371, 3.569, 3.253, 
3.531, 3.439, 3.472, 3.402, 3.459, 3.320, 3.260, 3.430, 3.452, 3.320, 3.499, 
3.302, 3.511, 3.520, 3.447, 3.516, 3.485, 3.345, 3.178, 3.492, 3.434, 3.619, 
3.483, 3.651, 3.833, 3.812, 3.433, 4.133, 3.855, 4.123, 3.999, 4.467, 4.731, 
4.539, 4.956, 4.644, 4.382, 4.277, 4.918, 4.784, 4.582, 4.915, 4.607, 4.672, 
4.577, 5.035, 5.241, 4.731, 4.688, 4.685, 4.657, 4.912, 4.300] \}
and on...
where CL-0 is the Cluster 0 and n=116 refers to the number of points observed 
by this cluster and c = [29.922 ...] refers to the center of Cluster as a 
vector and r = [3.463 ..] refers to the radius of the cluster as a vector.

Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action

[CONF] Apache Mahout > Cluster Dumper

Reply via email to