Dear Mahout Developers:
Some time ago I was working on the Elkan optimization for KMeans. It even
had a patch (https://issues.apache.org/jira/browse/MAHOUT-645).
Tests were made to measure speed on both original KMeans implementation and
Elkan's. Number of iterations was fixed to 4, convergenceDelta to 0 and
number of clusters was increased by 40 in every test.
Hardware used was an Amazon EMR cluster with 3 m2.xlarge nodes (1 master
with 2 core nodes).
The dataset has 350K vectors extracted from a Wikipedia dump after being
indexed by Solr. Vectors where extracted using Mahout's lucene.vector. File
size is 806MB. Vector dimensionality is approximately 28000.

Results are in minutes:

Number of clusters | KMeans  | Elkan
------------------------------------------------------
20                         | 20.11      | 17.82
60                         | 28.76      | 19.83
100                       | 44.00      | 24,48
140                       | 47.52      | 25.44
180                       | 57.23      | 28.01

I don't know if this implementation is faster than Streaming KMeans,
probably not, I just wanted to make it available because results seem
interesting and perhaps some property is useful.
It is available on Github: http://github.com/tavoaqp/mahout-elkan

Any suggestions or comments are welcome.

Regards.

Gustavo

Reply via email to