[CONF] Apache Mahout > ClusteringYourData

confluence Wed, 06 Apr 2011 00:22:27 -0700

Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: ClusteringYourData 
(https://cwiki.apache.org/confluence/display/MAHOUT/ClusteringYourData)


Change Comment:
---------------------------------------------------------------------
Remove JSON reference

Edited by Sean Owen:
---------------------------------------------------------------------
+*Mahout_0.4*+

After you've done the [Quickstart] and are familiar with the basics of Mahout, 
it is time to cluster your own data. 

The following pieces *may* be useful for in getting started:

h1. Input

For starters, you will need your data in an appropriate Vector format (which 
has changed since Mahout 0.1)

* See [Creating Vectors]

h2. Text Preparation

* See [Creating Vectors from Text] 
* 
http://www.lucidimagination.com/search/document/4a0e528982b2dac3/document_clustering

h1. Running the Process

h2. Canopy

Background: [canopy | Canopy Clustering]

Documentation of running canopy from the command line: [canopy-commandline]

h2. kMeans

Background: [K-Means Clustering]

Documentation of running kMeans from the command line: [k-means-commandline]

Documentation of running fuzzy kMeans from the command line: 
[fuzzy-k-means-commandline]

h2. Dirichlet

Background: [dirichlet | Dirichlet Process Clustering]

Documentation of running dirichlet from the command line: 
[dirichlet-commandline]

h2. Mean-shift

Background:  [meanshift | Mean Shift Clustering]

Documentation of running mean shift from the command line: 
[mean-shift-commandline]

h2. Latent Dirichlet Allocation

Background and documentation: [LDA| Latent Dirichlet Allocation]

Documentation of running LDA from the command line: [lda-commandline]

h1. Retrieving the Output

Mahout has a cluster dumper utility that can be used to retrieve and evaluate 
your clustering data.
{code}
./bin/mahout clusterdump <OPTIONS>
{code}

h2.The cluster dumper options are:
{code}
  --help (-h)                              Print out help                       
  --seqFileDir (-s) seqFileDir             The directory containing Sequence    
                                           Files for the Clusters               
  --output (-o) output                     The output file.  If not specified,  
                                           dumps to the console                 
  --substring (-b) substring               The number of chars of the           
                                           asFormatString() to print            
  --pointsDir (-p) pointsDir               The directory containing points      
                                           sequence files mapping input vectors 
                                           to their cluster.  If specified,     
                                           then the program will output the     
                                           points associated with a cluster     
            
  --dictionary (-d) dictionary             The dictionary file.                 
  --dictionaryType (-dt) dictionaryType    The dictionary file type             
                                           (text|sequencefile)                  
  --numWords (-n) numWords                 The number of top terms to print     
{code}

More information on using clusterdump utility can be found [here|Cluster Dumper]

h1. Validating the Output

>From Ted Dunning's response on See 
>http://www.lucidimagination.com/search/document/dab8c1f3c3addcfe/validating_clustering_output
{quote}
A principled approach to cluster evaluation is to measure how well the cluster 
membership captures the structure of unseen data.  A natural measure for this 
is to measure how much of the entropy of the data is captured by cluster 
membership.  For k-means and its natural L_2 metric, the natural cluster 
quality metric is the squared distance from the nearest centroid adjusted by 
the log_2 of the number of clusters.  This can be compared to the squared 
magnitude of the original data or the squared deviation from the centroid for 
all of the data.  The idea is that you are changing the representation of the 
data by allocating some of the bits in your original representation to 
represent which cluster each point is in.  If those bits aren't made up by the 
residue being small then your clustering is making a bad trade-off.

In the past, I have used other more heuristic measures as well.  One of the key 
characteristics that I would like to see out of a clustering is a degree of 
stability.  Thus, I look at the fractions of points that are assigned to each 
cluster or the distribution of distances from the cluster centroid. These 
values should be relatively stable when applied to held-out data.

For text, you can actually compute perplexity which measures how well cluster 
membership predicts what words are used.  This is nice because you don't have 
to worry about the entropy of real valued numbers.

Manual inspection and the so-called laugh test is also important.  The idea is 
that the results should not be so ludicrous as to make you laugh. 
Unfortunately, it is pretty easy to kid yourself into thinking your system is 
working using this kind of inspection.  The problem is that we are too good at 
seeing (making up) patterns.
{quote}


h1. References

* [Mahout archive 
references|http://www.lucidimagination.com/search/p:mahout?q=clustering]

Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action

[CONF] Apache Mahout > ClusteringYourData

Reply via email to