These speeds are not far from what the new streaming k-means achieves
except that instead of 16 nodes it reaches those speeds (1 million points
in 20 seconds at 10 dimensions) on a single node.  This is with a trivially
parallel algorithm with no need for iteration.  Running this under Hadoop
would incur the normal startup costs (10-20 seconds with MapR), but
otherwise should run at the same speed adjusted for node count.

See https://github.com/tdunning/knn/tree/master/docs for more info on this
clustering algorithm.

On Sat, May 26, 2012 at 9:26 AM, Thomas Jungblut <
[email protected]> wrote:

> We have benchmarks showing the scalability and maturity of Hama [3] and
> would be glad to roll out to several other Apache projects.
> BTW it would be cool if we could compare the performance of your k-means in
> MapReduce with that of our BSP version, you see the benchmark in [3] as
> well.
>
> Actually that was not why were are here, we wanted to hear some general
> interest in real-time recommendation with Hama since all the ML guys are
> here. Even if Ted is a fanboy of giraph ;)
>
> Regards from Berlin,
> Thomas
>
> [1] http://pulse.apache.org/#incubator.apache.org
> [2] http://code.google.com/p/psvm/
> [3] http://wiki.apache.org/hama/Benchmarks
>

Reply via email to