For the project, I've been doing some reading on Machine Learning in general (for example, there's a neat Caltech course called Learning from Data that I'm following to get a grip on the theoretical basics).
I decided that I'd read the clustering chapter in that book I sent you to since it deals specifically with large scale clustering. I was intrigued to find a description of the Bradley, Fayyad, Reina (BFR) algorithm which is also designed to cluster data in Euclidean spaces, in shards. For each chunk of data, it picks k cluster centroids (maximizing the distance between the selected points or some other technique). New points are grouped into these existing clusters as long as their Mahalanobis distance is less than a certain threshold (which results in the point being within threshold standard deviations from the centroid). The remaining points are clustered into smaller miniclusters with less stringent constraints with something like hierarchical clustering and the remaining singleton clusters are outliers. I was thinking this could just as well be done in Mappers and the resulting centroids (the outliers and the miniclusters being folded into the cluster whose centroid they're closest to) could be output to the Reducer as the same kind of rough clustering we're using Streaming KMeans for. My question is... why did we pick streaming k-means in particular as opposed to this algorithm. BFR seems like a decent candidate for the mapper clustering and while it looks more complex (algorithmically) I wonder how the clustering quality compares to streaming k-means? What are your thoughts on this?
