On Thu, Mar 28, 2013 at 11:30 PM, Andy Twigg <[email protected]> wrote:
> Dan/Ted: > > I like that you are implementing streaming k-means. > Thanks! > > Are there any results comparing it to mini batch k-means ([1] and the > paper cited therein) ? > I haven't read the mini batch k-means paper. It'll be nice to have as soon the quirks are worked out. And there seem to be some left right now. I think we should do definitely the comparison when it's done. > In the distributed implementation, you independently compute a > O(k)-means clustering on each partition, then combine them into a > final k-means. Are there any guarantees/results about the accuracy of > this? Clearly this sort of design also favours a storm/spark > implementation - have you considered that? > So there's an O(k log n) clustering on each partition and that is supposed to offer *some* guarantees. I don't fully understand the math behind it so I can't comment further. Ted? For now, I'd like to get the Hadoop version committed. I've heard really positive things about Spark but haven't tried it. The key constraint being that getting a high-quality implementation and evaluating it doubles as my senior project. Depending on how much longer this implementation takes, I'll say maybe to a Spark version. I think the comparison would definitely be interesting. :) > -Andy > > > > [1] > http://scikit-learn.org/dev/modules/generated/sklearn.cluster.MiniBatchKMeans.html >
