Factor of 2 is good.  The important aspect of this optimization is that it
works with any real metric.  The streaming k-means stuff is only good (at
this point) for L_2 metric.

In the other direction, I don't think that the triangle inequality will
help streaming k-means because search for centroids in the sketch is done
by a completely different method (and doesn't have to be as precise as with
the current k-means).


On Mon, Nov 12, 2012 at 12:56 PM, Gustavo Enrique Salazar Torres <
[email protected]> wrote:

> Dear Mahout Developers:
> Some time ago I was working on the Elkan optimization for KMeans. It even
> had a patch (https://issues.apache.org/jira/browse/MAHOUT-645).
> Tests were made to measure speed on both original KMeans implementation and
> Elkan's. Number of iterations was fixed to 4, convergenceDelta to 0 and
> number of clusters was increased by 40 in every test.
> Hardware used was an Amazon EMR cluster with 3 m2.xlarge nodes (1 master
> with 2 core nodes).
> The dataset has 350K vectors extracted from a Wikipedia dump after being
> indexed by Solr. Vectors where extracted using Mahout's lucene.vector. File
> size is 806MB. Vector dimensionality is approximately 28000.
>
> Results are in minutes:
>
> Number of clusters | KMeans  | Elkan
> ------------------------------------------------------
> 20                         | 20.11      | 17.82
> 60                         | 28.76      | 19.83
> 100                       | 44.00      | 24,48
> 140                       | 47.52      | 25.44
> 180                       | 57.23      | 28.01
>
> I don't know if this implementation is faster than Streaming KMeans,
> probably not, I just wanted to make it available because results seem
> interesting and perhaps some property is useful.
> It is available on Github: http://github.com/tavoaqp/mahout-elkan
>
> Any suggestions or comments are welcome.
>
> Regards.
>
> Gustavo
>

Reply via email to