Re: some new clustering code

Ted Dunning Fri, 06 Apr 2012 13:51:57 -0700

On Fri, Apr 6, 2012 at 12:48 PM, Federico Castanedo <[email protected]
> wrote:


> ...
> The only difference I notice between  Kmeans and StreamingKmeans class
> is the dynamic increment of maxClusters and the distanceCutoff test.
> So, i execute the KMeans class against a subset of the BigCross
> dataset and it works fine.
>

Yes.  StreamingKmeans is the updated version of Kmeans which was the fast
prototype.

The dynamic increment of maxClusters was intended to play to the strength
of this kind of 1-pass algorithm which is to find a lot of clusters that
characterize the data well.  The desired application is in nearest neighbor
search were we want to see thousands to hundreds of thousands of clusters,
but I wanted to make sure that users didn't accidentally get bad results by
specifying that they wanted 10 clusters from a million data points.  Note
that a side effect of this choice is that you really need much more than
100 data points to have this algorithm make sense.  The intended
application is with datasets in the millions of rows or more.

There is also a difference in that StreamingKmeans uses polymorphism to
allow the clustering of points to share code with the clustering of
centroids.  The growth of the distanceCutoff is also constrained in
StreamingKmeans to counter a tendency for the basic algorithm to grow it
larger than desired.  Unbounded growth may be theoretically desirable, but
it is practically very undesirable.

What is the rationale behind choosing  f = estimateCutoff() instead of
>  f=1/(k(1+logn)) of the original algorithm?
>

I think that the original algorithm was assuming that the data was
pre-scaled to a unit scale.  I don't like that since it requires an entire
extra pass over the data.  In order to deal with that, I use estimateCutoff
to estimate a distance which is "small" in terms of the data distribution.
 This distance is then rapidly increased as the limit on the max number of
centroids is hit repeatedly with the early data.



> On the other hand, the estimateCutoff() could be biased to the minimum
> distance of the first 100 points in the stream, or perhaps the idea is
> to update also this value online...
>

??  This is a lot like what happens.  I scan the first several data points
and take the distrance between the nearest pair as a scale.  From then, the
value is scaled up to make it more and more rare to need to collapse the
centroids.

Re: some new clustering code

Reply via email to