some new clustering code

Ted Dunning Wed, 04 Apr 2012 15:26:21 -0700

I have some new clustering code that I have been working.  It will probably
be targeted back at Mahout at some point, but for reasons of agility, I
have been running it out of github.


The salient point is that there are essentially no knobs that need turning
other than specifying a distance measure and possibly a large minimum
number of clusters.  The output is a clustering that can be searched
efficiently.

The key point is that this algorithm is

- single pass

- easily map-reducible

- fast

The third point is a salient one.  On my laptop running in single threaded
mode, this code is able to cluster 1,000,000 points in 20 dimensions into
1000 clusters in about a minute.

See the StreamingKmeans class at https://github.com/tdunning/knn for more
info.  The algorithm is based loosely on

http://web.engr.oregonstate.edu/~shindler/papers/FastKMeans_nips11.pdf


This code does not yet use the Mahout clustering API conventions, but is
based entirely on the Mahout math package.

Kibitzers welcome.

some new clustering code

Reply via email to