Hi all, These days i've been looking to this paper: "*Fast and Accurate *k*-means for Large Datasets",* recently presented in NIPS'2011. http://web.engr.oregonstate.edu/~shindler/papers/StreamingKMeans_soda11.pdf
It seems an outstanding state-of-the-art approach to implement streaming kmeans for very large datasets and my feeling is that could be something really cool to have into Mahout. I've just made a quick Java implementation (without M/R capabilities) into Mahout trunk code (based on Michael Shindler C++ implementation), but still need more work to do (test that it works correctly, improve some parts and cleaning code). Let me know if you think this method may be something good to have into Mahout. I would like to open a Jira ticket and integrate this new issue with your help if there is enough interest. Bests, Federico
