Hi all,

These days i've been looking to this paper:
"*Fast and Accurate *k*-means for Large Datasets",* recently presented in
NIPS'2011.
http://web.engr.oregonstate.edu/~shindler/papers/StreamingKMeans_soda11.pdf

It seems an outstanding state-of-the-art approach to implement streaming
kmeans for very large datasets
and my feeling is that could be something really cool to have into Mahout.

I've just made a quick Java implementation (without M/R capabilities) into
Mahout trunk code (based on Michael Shindler
C++ implementation), but still need more work to do (test that it works
correctly, improve some parts and cleaning code).
Let me know if you think this method may be something good to have into
Mahout. I would like to open a Jira ticket and
integrate this new issue with your help if there is enough interest.

Bests,
Federico

Reply via email to