Hi Grant,
I have been working on Top Down Clustering.
https://issues.apache.org/jira/browse/MAHOUT-843
In this, the top level clustering algorithm ( for eg. Canopy ) can run
with big t1,t2 values. And then any other clustering algorithm (selected
by user) is executed on clusters produced by top level clustering.
I have been able to configure top level and bottom level clustering with
some of the clustering algorithms available.
I will be submitting the patch sometime in this week. Using it, we will
be able to run Canopy Clustering ( or other clustering algorithms first
) to extract bigger clusters first and then apply other fine grained
clustering algorithms on the clusters extracted.
I think this will help in achieving what is needed.
Thanks and Regards,
Paritosh
On 02-11-2011 09:01, Grant Ingersoll wrote:
In reviewing clustering for upcoming training, I'm wondering about something w/
Canopy clustering that we claim, but wanted to check here first. In the
lectures, etc. I've seen on it, the idea is to run Canopy first and then some
other more expensive algorithm, such as k-means, etc. with the idea that items
further away than T2 are not even considered when scoring a centroid in the
more complex clustering approach. However, I think I'm missing where in the
code this actually happens. We do have code that allows K-Means to use the
Canopy centroids as initial centroids for k-means, but the other material
seemed to imply more aggressive pruning was possible since points outside of T2
would not even need to be considered. Otherwise, it doesn't seem like we are
saving anything by doing Canopy first other than we likely have a better set of
starting centroids. I haven't thought about how this would be implemented.
Then again, it's late and I'm tired.
-Grant
-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1411 / Virus Database: 2092/3990 - Release Date: 11/01/11