Dan/Ted: I like that you are implementing streaming k-means.
Are there any results comparing it to mini batch k-means ([1] and the paper cited therein) ? In the distributed implementation, you independently compute a O(k)-means clustering on each partition, then combine them into a final k-means. Are there any guarantees/results about the accuracy of this? Clearly this sort of design also favours a storm/spark implementation - have you considered that? -Andy [1] http://scikit-learn.org/dev/modules/generated/sklearn.cluster.MiniBatchKMeans.html
