Andy, would you like to review the final version of the clustering code before it goes in [1]? [1] https://reviews.apache.org/r/10194/
Ted, it's pretty much done. Okay it and I'll commit. On Wed, May 8, 2013 at 11:57 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > On Wed, May 8, 2013 at 10:28 AM, Dan Filimon <dangeorge.fili...@gmail.com > >wrote: > > > > > I think it avoids the need of the special way we handle the increase > of > > > > distanceCutoff by beta in another if. > > > > > > > > > > Sure. Sounds right and all. > > > > > > But experiment will tell better. > > > > yes. > > But I definitely saw cases where the same cutoff caused the centroid count > to decrease. In my mind, continuing to increase the cutoff in those cases > is a bad thing. A smaller cutoff is more conservative in that it will > preserve more data in the sketch. Until we see it preserving too much > data, we don't need to increase the cutoff. > I kept the overshoot just to be safe in the CL. > > > ... They > > > > actually call it a "facility cost" rather than a distance, probably > for > > > > this reason. > > > > Btw... the reason that they call it a facility cost is because they are > referring to a different literature. With k-means, k is traditionally > fixed. With facility assignment, it is traditionally not. The problems > are otherwise quite similar. The reason for the difference in nomenclature > is because the facility assignment stuff comes from operations research, > not computer science. > Ah, well that explains it. :) ... I'm uncomfortable with the distanceCutoff growing too high, but I'll > > just > > put the blame on that one on the data. > > > > I am uncomfortable as well. > > This is one reason I would like to only increase the distanceCutoff when a > small value proves ineffective. Alright, this is the version that's going in. > > StreamingKMeans + BallKMeans gave good results compared to Mahout > KMeans on > > other data sets (similar kinds of clusters and good looking Dunn and > > Davies-Bouldin indices). > > > > You hide this gem in a long email!!! > > Good news. Yeah. :) It's comparable to Mahout KMeans quality wise, and very tweakable. The speed improvements should be apparent on large data sets that we run on Hadoop. > > > > > > > The estimate we give it at the beginning is only valid as long as not > > > > enough datapoints have been processed to go over k log n. > > > > > > > > > > Are we talking about clusterOvershoot here? Or the numClusters > > over-ride? > > > > > > We collapse the clusters when the number of actual centroids is over > > clusterOvershoot * numClusters. > > I'm thinking that since numClusters increases anyway, clusterOvershoot > > means we end up with more clusters than we need (not bad per se, but > trying > > to get rid of variables). > > > > I view it as numClusters is the minimum number of clusters that we want to > see. ClusterOverShoot says that we can go a ways above the minimum, but we > hopefully will just collapse back down to the minimum or above. > > > > > > Well, we have seen cases where the over-shoot needed to be >1. Those > may > > > have gone away with better adaptation, but I think that they probably > > still > > > can happen. > > > > > > > Sorry, what do you mean by adaptation here? > > > > Better adjustment and use of the distanceCutoff. This should make the > collapse in the recursive clustering be less dramatic and more predictable. > That will make the system require less over-shoot. >