[CONF] Apache Mahout > Top Down Clustering

confluence Wed, 07 Dec 2011 03:06:28 -0800

Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Top Down Clustering 
(https://cwiki.apache.org/confluence/display/MAHOUT/Top+Down+Clustering)



Edited by Paritosh Ranjan:
---------------------------------------------------------------------
Top Down Clustering

Top Down clustering is a type of Hierarchical Clustering. It tries to find 
bigger clusters first and then does fine grained clustering on these clusters. 
Hence the name Top Down.

Any clustering algorithm can be used to perform the Top Level Clustering ( 
finding bigger clusters ) and the Bottom Level Clustering ( fine grained 
clustering on each of the top level clusters). So, all clustering algorithms 
available in Mahout, other than the MinHash Clustering algorithm ( which is a 
"Bottom Up" Clustering Algorithm ), are suitable to be used for Top Down 
Clustering, on both Top Level and Bottom Level.

The top level clustering output needs to be post processed in order to identify 
all top level clusters and, to group vectors into their respective top level 
clusters. So, that, the bottom level clustering can execute on each of them.

The first step to execute the top down clustering, would be to run any 
clustering algorithm of your choice, preferably with clustering parameters 
which will produce bigger clusters. This would be the top level clustering.

Then, the output of this clustering should be post processed, to group them 
into respective top level clusters. This can be done using 
*ClusterOutputPostProcessorDriver.*

*ClusterOutputPostProcessorDriver* has a run method

*run(Path input, Path output, boolean runSequential)*

The input parameter provided to run method is, _"the output path provided to 
the clustering algorithm"_, which would be post processed. It is the path of 
the directory containing clusters-*-final and clusteredPoints.

The output parameter provided to run method is _"the path where the post 
processed data would be stored"_.

The runSequential parameter provided to run method is _"If set to true, post 
processes it sequentially, else, uses, MapReduce to do it"_. Hint : If the 
clustering was done sequentially, make it sequential, else vice versa.

Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action

[CONF] Apache Mahout > Top Down Clustering

Reply via email to