Re: Streaming KMeans distance cutoff

Ted Dunning Thu, 09 May 2013 13:37:48 -0700

I like shuffling out of the same belt and suspenders distrust of well worn
tracks.


The number of centroids is tiny compared to the original data so shuffling
or copying an extra time isn't a big deal.


On Thu, May 9, 2013 at 12:21 PM, Dan Filimon <dangeorge.fili...@gmail.com>wrote:

> I haven't noticed, but it makes me feel somewhat (irrationally :) better
> knowing that the points don't come through in the same order they
> previously came in.
> I thought of maybe having a flag, but I'm kind of split on the issue.
>
> Even if they aren't shuffled, we need to copy them to another list before
> collapsing anyway so we'd still be looping through them once.
>
>
> On Thu, May 9, 2013 at 10:09 PM, Andy Twigg <andy.tw...@gmail.com> wrote:
>
> > Hi Dan,
> >
> > Sure. I took a quick look just now and it looks good. Did you notice that
> > shuffling before collapsing was helping, hence keeping it in? It didn't
> > make much difference for me.
> >
> > Andy
> >
> >
> >
> > On 9 May 2013 16:05, Dan Filimon <dangeorge.fili...@gmail.com> wrote:
> >
> >> Andy, would you like to review the final version of the clustering code
> >> before it goes in [1]?
> >> [1] https://reviews.apache.org/r/10194/
> >>
> >> Ted, it's pretty much done. Okay it and I'll commit.
> >>
> >>
> >> On Wed, May 8, 2013 at 11:57 PM, Ted Dunning <ted.dunn...@gmail.com
> >wrote:
> >>
> >>> On Wed, May 8, 2013 at 10:28 AM, Dan Filimon <
> >>> dangeorge.fili...@gmail.com>wrote:
> >>>
> >>> > > > I think it avoids the need of the special way we handle the
> >>> increase of
> >>> > > > distanceCutoff by beta in another if.
> >>> > > >
> >>> > >
> >>> > > Sure.  Sounds right and all.
> >>> > >
> >>> > > But experiment will tell better.
> >>> >
> >>>
> >>> yes.
> >>>
> >>> But I definitely saw cases where the same cutoff caused the centroid
> >>> count
> >>> to decrease.  In my mind, continuing to increase the cutoff in those
> >>> cases
> >>> is a bad thing.  A smaller cutoff is more conservative in that it will
> >>> preserve more data in the sketch.  Until we see it preserving too much
> >>> data, we don't need to increase the cutoff.
> >>>
> >>
> >> I kept the overshoot just to be safe in the CL.
> >>
> >> > > > ... They
> >>> > > > actually call it a "facility cost" rather than a distance,
> >>> probably for
> >>> > > > this reason.
> >>> >
> >>>
> >>> Btw... the reason that they call it a facility cost is because they are
> >>> referring to a different literature.  With k-means, k is traditionally
> >>> fixed.  With facility assignment, it is traditionally not.  The
> problems
> >>> are otherwise quite similar.  The reason for the difference in
> >>> nomenclature
> >>> is because the facility assignment stuff comes from operations
> research,
> >>> not computer science.
> >>>
> >>
> >> Ah, well that explains it. :)
> >>
> >> ... I'm uncomfortable with the distanceCutoff growing too high, but I'll
> >>> > just
> >>> > put the blame on that one on the data.
> >>> >
> >>>
> >>> I am uncomfortable as well.
> >>>
> >>> This is one reason I would like to only increase the distanceCutoff
> when
> >>> a
> >>> small value proves ineffective.
> >>
> >>
> >> Alright, this is the version that's going in.
> >>
> >>
> >>>  > StreamingKMeans + BallKMeans gave good results compared to Mahout
> >>> KMeans on
> >>> > other data sets (similar kinds of clusters and good looking Dunn and
> >>> > Davies-Bouldin indices).
> >>> >
> >>>
> >>> You hide this gem in a long email!!!
> >>>
> >>> Good news.
> >>
> >>
> >> Yeah. :)
> >> It's comparable to Mahout KMeans quality wise, and very tweakable.
> >> The speed improvements should be apparent on large data sets that we run
> >> on Hadoop.
> >>
> >> > >
> >>> >
> >>> > > The estimate we give it at the beginning is only valid as long as
> not
> >>> > > > enough datapoints have been processed to go over k log n.
> >>> > > >
> >>> > >
> >>> > > Are we talking about clusterOvershoot here?  Or the numClusters
> >>> > over-ride?
> >>> >
> >>> >
> >>> > We collapse the clusters when the number of actual centroids is over
> >>> > clusterOvershoot * numClusters.
> >>> > I'm thinking that since numClusters increases anyway,
> clusterOvershoot
> >>> > means we end up with more clusters than we need (not bad per se, but
> >>> trying
> >>> > to get rid of variables).
> >>> >
> >>>
> >>> I view it as numClusters is the minimum number of clusters that we want
> >>> to
> >>> see.  ClusterOverShoot says that we can go a ways above the minimum,
> but
> >>> we
> >>> hopefully will just collapse back down to the minimum or above.
> >>>
> >>>
> >>>
> >>> > > Well, we have seen cases where the over-shoot needed to be >1.
> >>>  Those may
> >>> > > have gone away with better adaptation, but I think that they
> probably
> >>> > still
> >>> > > can happen.
> >>> > >
> >>> >
> >>> > Sorry, what do you mean by adaptation here?
> >>> >
> >>>
> >>> Better adjustment and use of the distanceCutoff.  This should make the
> >>> collapse in the recursive clustering be less dramatic and more
> >>> predictable.
> >>>  That will make the system require less over-shoot.
> >>>
> >>
> >>
> >
> >
> > --
> > Dr Andy Twigg
> > Junior Research Fellow, St Johns College, Oxford
> > Room 351, Department of Computer Science
> > http://www.cs.ox.ac.uk/people/andy.twigg/
> > andy.tw...@cs.ox.ac.uk | +447799647538
> >
>

Re: Streaming KMeans distance cutoff

Reply via email to