On Fri, Apr 6, 2012 at 12:48 PM, Federico Castanedo <[email protected] > wrote:
> ... > The only difference I notice between Kmeans and StreamingKmeans class > is the dynamic increment of maxClusters and the distanceCutoff test. > So, i execute the KMeans class against a subset of the BigCross > dataset and it works fine. > Yes. StreamingKmeans is the updated version of Kmeans which was the fast prototype. The dynamic increment of maxClusters was intended to play to the strength of this kind of 1-pass algorithm which is to find a lot of clusters that characterize the data well. The desired application is in nearest neighbor search were we want to see thousands to hundreds of thousands of clusters, but I wanted to make sure that users didn't accidentally get bad results by specifying that they wanted 10 clusters from a million data points. Note that a side effect of this choice is that you really need much more than 100 data points to have this algorithm make sense. The intended application is with datasets in the millions of rows or more. There is also a difference in that StreamingKmeans uses polymorphism to allow the clustering of points to share code with the clustering of centroids. The growth of the distanceCutoff is also constrained in StreamingKmeans to counter a tendency for the basic algorithm to grow it larger than desired. Unbounded growth may be theoretically desirable, but it is practically very undesirable. What is the rationale behind choosing f = estimateCutoff() instead of > f=1/(k(1+logn)) of the original algorithm? > I think that the original algorithm was assuming that the data was pre-scaled to a unit scale. I don't like that since it requires an entire extra pass over the data. In order to deal with that, I use estimateCutoff to estimate a distance which is "small" in terms of the data distribution. This distance is then rapidly increased as the limit on the max number of centroids is hit repeatedly with the early data. > On the other hand, the estimateCutoff() could be biased to the minimum > distance of the first 100 points in the stream, or perhaps the idea is > to update also this value online... > ?? This is a lot like what happens. I scan the first several data points and take the distrance between the nearest pair as a scale. From then, the value is scaled up to make it more and more rare to need to collapse the centroids.
