Factor of 2 is good. The important aspect of this optimization is that it works with any real metric. The streaming k-means stuff is only good (at this point) for L_2 metric.
In the other direction, I don't think that the triangle inequality will help streaming k-means because search for centroids in the sketch is done by a completely different method (and doesn't have to be as precise as with the current k-means). On Mon, Nov 12, 2012 at 12:56 PM, Gustavo Enrique Salazar Torres < [email protected]> wrote: > Dear Mahout Developers: > Some time ago I was working on the Elkan optimization for KMeans. It even > had a patch (https://issues.apache.org/jira/browse/MAHOUT-645). > Tests were made to measure speed on both original KMeans implementation and > Elkan's. Number of iterations was fixed to 4, convergenceDelta to 0 and > number of clusters was increased by 40 in every test. > Hardware used was an Amazon EMR cluster with 3 m2.xlarge nodes (1 master > with 2 core nodes). > The dataset has 350K vectors extracted from a Wikipedia dump after being > indexed by Solr. Vectors where extracted using Mahout's lucene.vector. File > size is 806MB. Vector dimensionality is approximately 28000. > > Results are in minutes: > > Number of clusters | KMeans | Elkan > ------------------------------------------------------ > 20 | 20.11 | 17.82 > 60 | 28.76 | 19.83 > 100 | 44.00 | 24,48 > 140 | 47.52 | 25.44 > 180 | 57.23 | 28.01 > > I don't know if this implementation is faster than Streaming KMeans, > probably not, I just wanted to make it available because results seem > interesting and perhaps some property is useful. > It is available on Github: http://github.com/tavoaqp/mahout-elkan > > Any suggestions or comments are welcome. > > Regards. > > Gustavo >
