On Fri, Apr 4, 2014 at 8:14 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

> On Thu, Apr 3, 2014 at 10:35 PM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
>
> > - Have you considered sketch based algorithms?
> >
>
> can you give me a reference. at this point i am just  contemplating more or
> less shameless port of what they've done in mllib.
>

Here is the reference I used:

http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets

*A quick summary:*

- a single pass over the data with a sort of dp-means [1] algorithm using a
very fast approximate centroid search can give you k log N centroids.  This
is the sketch.

- clustering these centroids as weighted values gives a provably probably
good clustering of the original data for well clusterable data.  For real
data, it seems to work exactly as advertised.

- clustering the sketch using ball k-means with clever initialization is
important.  Good initialization is very expensive in terms of number of
points so using a sketch is a really good idea.

- the proofs all depend on Euclidean distance.

- the threshold can start very small (what small means is determined
empirically).  Each time we wind up with too many centroids, recursively
clustering the centroids and possibly increasing the threshold will keep
the number reasonably bounded.

*Some notes from practical experience:*

- moderate levels of parallelism are easy since sketches can be merged
using set union.  You may want to recursively cluster the centroids at this
point if you have too many.  This is very nice application of map-reduce.

- high degrees of parallelism require multiple levels of merge/collapse
since otherwise you wind up with a sketch nearly as large as the original
data.  If you have m parallel clusterings then m k log (N/m) can be large.
 Take a billion points and m = 1000 and k = 10,000.  The size of the
parallel sketches is mk log(N/m) = 200 x 10^6 = 0.2 N.  But with m = 100, k
= 1000, we have mk log(N/m) = 210 x 10^3 which is quite manageable.

- ball k-means on highly clusterable data often uses a cutoff for centroid
computation of 0.5 x distance to nearest.  I find that with real data that
1 x distance or even larger to nearest is much better.  The good effects
are still mostly there, but you don't need wonderful data to succeed

- the ball k-means algorithm that we have in Mahout is pretty high quality,
but could use a bit of triangle (Elkan [2] ) speedups.

*References*

[1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
[2] http://cseweb.ucsd.edu/~elkan/kmeansicml03.pdf


>
> >
> > - It can be important to use optimizations in the search for nearest
> > centroid.  Consider triangle optimizations.
> >
> > - Do you mean "parallel" when you type || or is there another meaning
> > there?
> >
>
> No, i mean method called "kmeans||". It's unfortunate name since I really
> don't know how to make google to search for it.
>
> http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
>
> >
> > - When you say ++ initialization, many people get this wrong and assume
> > that you mean pick the furthest point.  Getting really good
> initialization
> > is fairly difficult and typically requires more time than the actual
> > clustering.  This is one of the key benefits of sketch based methods.
> >
> > - Most algorithms require multiple restarts.  At higher dimension the
> > number of restarts required becomes very large.  An ideal implementation
> > does parallel sketch extraction followed by parallel ball k-means for
> > restarts.
> >
> >
> >
> > On Wed, Apr 2, 2014 at 9:03 AM, Dmitriy Lyubimov <dlie...@gmail.com>
> > wrote:
> >
> > > Considering porting implementation [1] and paper for KMeans || for
> > > Bindings.
> > >
> > > This seems like another method to map fairly nicely.
> > >
> > > The problem I am contemplating is ||-initialization, and in particular,
> > > centroid storage. That particular implementation assumes centroids
> could
> > be
> > > kept in memory in front.
> > >
> > > (1) Question is, is it a dangerous idea. It doesn't seem like it
> > > particularly is, since unlikely people would want more k>1e+6. Another
> > > thing, centers seem to be passed in via closure attribute (i.e.
> > > java-serialized array-backed matrix).However, with Bindings it is quite
> > > possible to keep centers at the back as a matrix.
> > >
> > > (2) obviously, LLoyd iterations are not terribly accurate. || and ++
> > > versions mostly speed things up. Is there any better-than-LLoyd
> accuracy
> > > preference?
> > >
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
> > >
> >
>

Reply via email to