- Have you considered sketch based algorithms?

- It can be important to use optimizations in the search for nearest
centroid.  Consider triangle optimizations.

- Do you mean "parallel" when you type || or is there another meaning there?

- When you say ++ initialization, many people get this wrong and assume
that you mean pick the furthest point.  Getting really good initialization
is fairly difficult and typically requires more time than the actual
clustering.  This is one of the key benefits of sketch based methods.

- Most algorithms require multiple restarts.  At higher dimension the
number of restarts required becomes very large.  An ideal implementation
does parallel sketch extraction followed by parallel ball k-means for
restarts.



On Wed, Apr 2, 2014 at 9:03 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

> Considering porting implementation [1] and paper for KMeans || for
> Bindings.
>
> This seems like another method to map fairly nicely.
>
> The problem I am contemplating is ||-initialization, and in particular,
> centroid storage. That particular implementation assumes centroids could be
> kept in memory in front.
>
> (1) Question is, is it a dangerous idea. It doesn't seem like it
> particularly is, since unlikely people would want more k>1e+6. Another
> thing, centers seem to be passed in via closure attribute (i.e.
> java-serialized array-backed matrix).However, with Bindings it is quite
> possible to keep centers at the back as a matrix.
>
> (2) obviously, LLoyd iterations are not terribly accurate. || and ++
> versions mostly speed things up. Is there any better-than-LLoyd accuracy
> preference?
>
>
> [1]
>
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
>

Reply via email to