These speeds are not far from what the new streaming k-means achieves except that instead of 16 nodes it reaches those speeds (1 million points in 20 seconds at 10 dimensions) on a single node. This is with a trivially parallel algorithm with no need for iteration. Running this under Hadoop would incur the normal startup costs (10-20 seconds with MapR), but otherwise should run at the same speed adjusted for node count.
See https://github.com/tdunning/knn/tree/master/docs for more info on this clustering algorithm. On Sat, May 26, 2012 at 9:26 AM, Thomas Jungblut < [email protected]> wrote: > We have benchmarks showing the scalability and maturity of Hama [3] and > would be glad to roll out to several other Apache projects. > BTW it would be cool if we could compare the performance of your k-means in > MapReduce with that of our BSP version, you see the benchmark in [3] as > well. > > Actually that was not why were are here, we wanted to hear some general > interest in real-time recommendation with Hama since all the ML guys are > here. Even if Ted is a fanboy of giraph ;) > > Regards from Berlin, > Thomas > > [1] http://pulse.apache.org/#incubator.apache.org > [2] http://code.google.com/p/psvm/ > [3] http://wiki.apache.org/hama/Benchmarks >
