Dear Mahout Developers: Some time ago I was working on the Elkan optimization for KMeans. It even had a patch (https://issues.apache.org/jira/browse/MAHOUT-645). Tests were made to measure speed on both original KMeans implementation and Elkan's. Number of iterations was fixed to 4, convergenceDelta to 0 and number of clusters was increased by 40 in every test. Hardware used was an Amazon EMR cluster with 3 m2.xlarge nodes (1 master with 2 core nodes). The dataset has 350K vectors extracted from a Wikipedia dump after being indexed by Solr. Vectors where extracted using Mahout's lucene.vector. File size is 806MB. Vector dimensionality is approximately 28000.
Results are in minutes: Number of clusters | KMeans | Elkan ------------------------------------------------------ 20 | 20.11 | 17.82 60 | 28.76 | 19.83 100 | 44.00 | 24,48 140 | 47.52 | 25.44 180 | 57.23 | 28.01 I don't know if this implementation is faster than Streaming KMeans, probably not, I just wanted to make it available because results seem interesting and perhaps some property is useful. It is available on Github: http://github.com/tavoaqp/mahout-elkan Any suggestions or comments are welcome. Regards. Gustavo
