In reviewing clustering for upcoming training, I'm wondering about something w/ Canopy clustering that we claim, but wanted to check here first. In the lectures, etc. I've seen on it, the idea is to run Canopy first and then some other more expensive algorithm, such as k-means, etc. with the idea that items further away than T2 are not even considered when scoring a centroid in the more complex clustering approach. However, I think I'm missing where in the code this actually happens. We do have code that allows K-Means to use the Canopy centroids as initial centroids for k-means, but the other material seemed to imply more aggressive pruning was possible since points outside of T2 would not even need to be considered. Otherwise, it doesn't seem like we are saving anything by doing Canopy first other than we likely have a better set of starting centroids. I haven't thought about how this would be implemented.
Then again, it's late and I'm tired. -Grant
