Re: Canopy and other clustering approaches

Paritosh Ranjan Tue, 01 Nov 2011 20:46:56 -0700

Hi Grant,

I have been working on Top Down Clustering.https://issues.apache.org/jira/browse/MAHOUT-843

In this, the top level clustering algorithm ( for eg. Canopy ) can runwith big t1,t2 values. And then any other clustering algorithm (selectedby user) is executed on clusters produced by top level clustering.

I have been able to configure top level and bottom level clustering withsome of the clustering algorithms available.

I will be submitting the patch sometime in this week. Using it, we willbe able to run Canopy Clustering ( or other clustering algorithms first) to extract bigger clusters first and then apply other fine grainedclustering algorithms on the clusters extracted.


I think this will help in achieving what is needed.

Thanks and Regards,
Paritosh

On 02-11-2011 09:01, Grant Ingersoll wrote:

In reviewing clustering for upcoming training, I'm wondering about something w/ 
Canopy clustering that we claim, but wanted to check here first.  In the 
lectures, etc. I've seen on it, the idea is to run Canopy first and then some 
other more expensive algorithm, such as k-means, etc. with the idea that items 
further away than T2 are not even considered when scoring a centroid in the 
more complex clustering approach.  However, I think I'm missing where in the 
code this actually happens.  We do have code that allows K-Means to use the 
Canopy centroids as initial centroids for k-means, but the other material 
seemed to imply more aggressive pruning was possible since points outside of T2 
would not even need to be considered.  Otherwise, it doesn't seem like we are 
saving anything by doing Canopy first other than we likely have a better set of 
starting centroids.  I haven't thought about how this would be implemented.

Then again, it's late and I'm tired.

-Grant

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1411 / Virus Database: 2092/3990 - Release Date: 11/01/11

Re: Canopy and other clustering approaches

Reply via email to