On Thu, Mar 28, 2013 at 11:30 PM, Andy Twigg <[email protected]> wrote:

> Dan/Ted:
>
> I like that you are implementing streaming k-means.
>

Thanks!


>
> Are there any results comparing it to mini batch k-means ([1] and the
> paper cited therein) ?
>

I haven't read the mini batch k-means paper. It'll be nice to have as soon
the quirks are worked out. And there seem to be some left right now. I
think we should do definitely the comparison when it's done.


> In the distributed implementation, you independently compute a
> O(k)-means clustering on each partition, then combine them into a
> final k-means. Are there any guarantees/results about the accuracy of
> this? Clearly this sort of design also favours a storm/spark
> implementation - have you considered that?
>

So there's an O(k log n) clustering on each partition and that is supposed
to offer *some* guarantees. I don't fully understand the math behind it so
I can't comment further. Ted?

For now, I'd like to get the Hadoop version committed. I've heard really
positive things about Spark but haven't tried it.

The key constraint being that getting a high-quality implementation and
evaluating it doubles as my senior project. Depending on how much longer
this implementation takes, I'll say maybe to a Spark version. I think the
comparison would definitely be interesting. :)


> -Andy
>
>
>
> [1]
> http://scikit-learn.org/dev/modules/generated/sklearn.cluster.MiniBatchKMeans.html
>

Reply via email to