I have some new clustering code that I have been working.  It will probably
be targeted back at Mahout at some point, but for reasons of agility, I
have been running it out of github.

The salient point is that there are essentially no knobs that need turning
other than specifying a distance measure and possibly a large minimum
number of clusters.  The output is a clustering that can be searched
efficiently.

The key point is that this algorithm is

- single pass

- easily map-reducible

- fast

The third point is a salient one.  On my laptop running in single threaded
mode, this code is able to cluster 1,000,000 points in 20 dimensions into
1000 clusters in about a minute.

See the StreamingKmeans class at https://github.com/tdunning/knn for more
info.  The algorithm is based loosely on

http://web.engr.oregonstate.edu/~shindler/papers/FastKMeans_nips11.pdf


This code does not yet use the Mahout clustering API conventions, but is
based entirely on the Mahout math package.

Kibitzers welcome.

Reply via email to