Hmmm....
I think that computing centroids in the mapper may not be the best idea. A different structure that would work well is to use the mapper to assign data records to centroids and use the centroid number as the key for the reduce key. Then the reduce itself can compute the centroids. You can read the old centroids from HDFS in the configure method of the mapper. Lather, rinse, repeat. This process avoids moving large amounts of data through the configuration process. This method can be extended to more advanced approaches such as Gaussian mixtures by emitting each input record multiple times with multiple centroid keys and a strength of association. Computing centroids in the mapper works well in that it minimizes the amount of data that is passed to the reducers, but it critically depends on the availability of sufficient statistic for computing cluster centroids. This works fine for Gaussian processes (aka k-means), but there are other mixture models that require fancier updates than this. Computing centroids in the reducer allows you avoid your problem with the output collector. If sufficient statistics like sums (means) are available then you can use a combiner to do the reduction incrementally and avoid moving too much data around. The reducer will still have to accumulate these partial updates for final output, but it won't have to compute very much of them. All of this is completely analogous to word-counting, actually. You don't accumulate counts in the mapper; you accumulate partial sums in the combiner and final sums in the reducer. On 2/9/08 4:21 PM, "Jeff Eastman" <[EMAIL PROTECTED]> wrote: > Thanks Aaron, I missed that one. Now I have my configuration information > in my mapper. In the mapper, I'm computing cluster centroids by reading > all the input points and assigning them to clusters. I don't actually > store the points in the mapper, just the evolving centroids. > > I'm trying to wait until close() to output the cluster centroids to the > reducer, but the OutputCollector is not available. Is there a way to do > this, or do I need to backtrack? > > Jeff > >
