On Jun 23, 2009, at 11:18 AM, Jeff Eastman wrote:
That makes sense, though I don't understand why the reducer is not doing its job in the test you cite. I've had to do manual things (like calling close() in the unit tests to get all of the functionality to exercise. All of the clustering algorithms behave similarly: each cluster has a center (prior) which is used to observe some of the data (observations) based upon a distance function (pdf), which is used to compute its new centroid (posterior). I think it is possible to abstract them into a common framework using this model.
It makes sense b/c the M/R pieces rely on the fact that everything round trips through the serialization/deserialization phase, whereas that particular test does not do that. The centroid from one iteration thus becomes the center for the next iteration, AFAICT.
