BFR clustering algorithm?

Dan Filimon Mon, 03 Dec 2012 10:48:20 -0800

For the project,

I've been doing some reading on Machine Learning in general (for
example, there's a neat Caltech course called Learning from Data that
I'm following to get a grip on the theoretical basics).


I decided that I'd read the clustering chapter in that book I sent you
to since it deals specifically with large scale clustering.

I was intrigued to find a description of the Bradley, Fayyad, Reina
(BFR) algorithm which is also designed to cluster data in Euclidean
spaces, in shards.

For each chunk of data, it picks k cluster centroids (maximizing the
distance between the selected points or some other technique).

New points are grouped into these existing clusters as long as their
Mahalanobis distance is less than a certain threshold (which results
in the point being within threshold standard deviations from the
centroid).

The remaining points are clustered into smaller miniclusters with less
stringent constraints with something like hierarchical clustering and
the remaining singleton clusters are outliers.

I was thinking this could just as well be done in Mappers and the
resulting centroids (the outliers and the miniclusters being folded
into the cluster whose centroid they're closest to) could be output to
the Reducer as the same kind of rough clustering we're using Streaming
KMeans for.

My question is... why did we pick streaming k-means in particular as
opposed to this algorithm. BFR seems like a decent candidate for the
mapper clustering and while it looks more complex (algorithmically) I
wonder how the clustering quality compares to streaming k-means?

What are your thoughts on this?

BFR clustering algorithm?

Reply via email to