Re: Streaming KMeans distance cutoff

Dan Filimon Thu, 09 May 2013 08:06:42 -0700

Andy, would you like to review the final version of the clustering code
before it goes in [1]?
[1] https://reviews.apache.org/r/10194/


Ted, it's pretty much done. Okay it and I'll commit.


On Wed, May 8, 2013 at 11:57 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> On Wed, May 8, 2013 at 10:28 AM, Dan Filimon <dangeorge.fili...@gmail.com
> >wrote:
>
> > > > I think it avoids the need of the special way we handle the increase
> of
> > > > distanceCutoff by beta in another if.
> > > >
> > >
> > > Sure.  Sounds right and all.
> > >
> > > But experiment will tell better.
> >
>
> yes.
>
> But I definitely saw cases where the same cutoff caused the centroid count
> to decrease.  In my mind, continuing to increase the cutoff in those cases
> is a bad thing.  A smaller cutoff is more conservative in that it will
> preserve more data in the sketch.  Until we see it preserving too much
> data, we don't need to increase the cutoff.
>

I kept the overshoot just to be safe in the CL.

> > > ... They
> > > > actually call it a "facility cost" rather than a distance, probably
> for
> > > > this reason.
> >
>
> Btw... the reason that they call it a facility cost is because they are
> referring to a different literature.  With k-means, k is traditionally
> fixed.  With facility assignment, it is traditionally not.  The problems
> are otherwise quite similar.  The reason for the difference in nomenclature
> is because the facility assignment stuff comes from operations research,
> not computer science.
>

Ah, well that explains it. :)

... I'm uncomfortable with the distanceCutoff growing too high, but I'll
> > just
> > put the blame on that one on the data.
> >
>
> I am uncomfortable as well.
>
> This is one reason I would like to only increase the distanceCutoff when a
> small value proves ineffective.


Alright, this is the version that's going in.


>  > StreamingKMeans + BallKMeans gave good results compared to Mahout
> KMeans on
> > other data sets (similar kinds of clusters and good looking Dunn and
> > Davies-Bouldin indices).
> >
>
> You hide this gem in a long email!!!
>
> Good news.


Yeah. :)
It's comparable to Mahout KMeans quality wise, and very tweakable.
The speed improvements should be apparent on large data sets that we run on
Hadoop.

> >
> >
> > > The estimate we give it at the beginning is only valid as long as not
> > > > enough datapoints have been processed to go over k log n.
> > > >
> > >
> > > Are we talking about clusterOvershoot here?  Or the numClusters
> > over-ride?
> >
> >
> > We collapse the clusters when the number of actual centroids is over
> > clusterOvershoot * numClusters.
> > I'm thinking that since numClusters increases anyway, clusterOvershoot
> > means we end up with more clusters than we need (not bad per se, but
> trying
> > to get rid of variables).
> >
>
> I view it as numClusters is the minimum number of clusters that we want to
> see.  ClusterOverShoot says that we can go a ways above the minimum, but we
> hopefully will just collapse back down to the minimum or above.
>
>
>
> > > Well, we have seen cases where the over-shoot needed to be >1.  Those
> may
> > > have gone away with better adaptation, but I think that they probably
> > still
> > > can happen.
> > >
> >
> > Sorry, what do you mean by adaptation here?
> >
>
> Better adjustment and use of the distanceCutoff.  This should make the
> collapse in the recursive clustering be less dramatic and more predictable.
>  That will make the system require less over-shoot.
>

Re: Streaming KMeans distance cutoff

Reply via email to