This does seem interesting (within the context of a quick skim).  I have a
few questions, however.

First, how does this compare in practice with k-means++ (which we still
don't have)?

Secondly, what about parallelism?

Thirdly, would it be better to simply retrofit something like an all-reduce
operation into our current k-means to avoid map-reduce iterations?

On Sun, Jan 15, 2012 at 9:23 PM, Federico Castanedo <[email protected]
> wrote:

> Hi all,
>
> These days i've been looking to this paper:
> "*Fast and Accurate *k*-means for Large Datasets",* recently presented in
> NIPS'2011.
> http://web.engr.oregonstate.edu/~shindler/papers/StreamingKMeans_soda11.pdf
>
> It seems an outstanding state-of-the-art approach to implement streaming
> kmeans for very large datasets
> and my feeling is that could be something really cool to have into Mahout.
>
> I've just made a quick Java implementation (without M/R capabilities) into
> Mahout trunk code (based on Michael Shindler
> C++ implementation), but still need more work to do (test that it works
> correctly, improve some parts and cleaning code).
> Let me know if you think this method may be something good to have into
> Mahout. I would like to open a Jira ticket and
> integrate this new issue with your help if there is enough interest.
>
> Bests,
> Federico
>

Reply via email to