Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT)
Page: k-means-commandline 
(http://cwiki.apache.org/confluence/display/MAHOUT/k-means-commandline)


Edited by Jeff Eastman:
---------------------------------------------------------------------
h1. Introduction

This quick start page describes how to run the kMeans clustering algorithm on a 
Hadoop cluster. 

h1. Steps

Mahout's k-Means clustering can be launched from the same command line 
invocation whether you are running on a single machine in stand-alone mode or 
on a larger Hadoop cluster. The difference is determined by the $HADOOP_HOME 
and $HADOOP_CONF_DIR environment variables. If both are set to an operating 
Hadoop cluster on the target machine then the invocation will run k-Means on 
that cluster. If either of the environment variables are missing then the 
stand-alone Hadoop configuration will be invoked instead.

{code}
./bin/mahout kmeans <OPTIONS>
{code}

* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job will 
be generated in $MAHOUT_HOME/core/target/ and it's name will contain the Mahout 
version number. For example, when using Mahout 0.3 release, the job will be 
mahout-core-0.3.job


h2. Testing it on one single machine w/o cluster

* Put the data: cp <PATH TO DATA> testdata
* Run the Job: 
{code}
./bin/mahout kmeans -i testdata -o output -c clusters -dm 
org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k 25
{code}

h2. Running it on the cluster

* (As needed) Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
* Run the Job: 
{code}
export HADOOP_HOME=<Hadoop Home Directory>
export HADOOP_CONF_DIR=$HADOOP_HOME/conf
./bin/mahout kmeans -i testdata -o output -c clusters -dm 
org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cd 1 -k 25
{code}
* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output to 
view all outputs.

h1. Command line options
{code}
  --input (-i) input                           Path to job input directory.     
                                               Must be a SequenceFile of        
                                               VectorWritable                   
  --clusters (-c) clusters                     The input centroids, as Vectors. 
                                               Must be a SequenceFile of        
                                               Writable, Cluster/Canopy.  If k  
                                               is also specified, then a random 
                                               set of vectors will be selected  
                                               and written out to this path     
                                               first                            
  --output (-o) output                         The directory pathname for       
                                               output.                          
  --distanceMeasure (-dm) distanceMeasure      The classname of the             
                                               DistanceMeasure. Default is      
                                               SquaredEuclidean                 
  --convergenceDelta (-cd) convergenceDelta    The convergence delta value.     
                                               Default is 0.5                   
  --maxIter (-x) maxIter                       The maximum number of            
                                               iterations.                      
  --maxRed (-r) maxRed                         The number of reduce tasks.      
                                               Defaults to 2                    
  --k (-k) k                                   The k in k-Means.  If specified, 
                                               then a random selection of k     
                                               Vectors will be chosen as the    
                                               Centroid and written to the      
                                               clusters input path.             
  --overwrite (-ow)                            If present, overwrite the output 
                                               directory before running job     
  --help (-h)                                  Print out help                   
  --clustering (-cl)                           If present, run clustering after 
                                               the iterations have taken place  
{code}

Change your notification preferences: 
http://cwiki.apache.org/confluence/users/viewnotifications.action    

Reply via email to