I have some new clustering code that I have been working. It will probably be targeted back at Mahout at some point, but for reasons of agility, I have been running it out of github.
The salient point is that there are essentially no knobs that need turning other than specifying a distance measure and possibly a large minimum number of clusters. The output is a clustering that can be searched efficiently. The key point is that this algorithm is - single pass - easily map-reducible - fast The third point is a salient one. On my laptop running in single threaded mode, this code is able to cluster 1,000,000 points in 20 dimensions into 1000 clusters in about a minute. See the StreamingKmeans class at https://github.com/tdunning/knn for more info. The algorithm is based loosely on http://web.engr.oregonstate.edu/~shindler/papers/FastKMeans_nips11.pdf This code does not yet use the Mahout clustering API conventions, but is based entirely on the Mahout math package. Kibitzers welcome.
