Hmmmm indeed, this is certainly food for thought. I'm cross-posting this to Mahout since it bears upon my recent submission there. Here's what that does, and also how I think I can incorporate these ideas into it too.
Each canopy mapper sees only a subset of the points. It goes ahead and assigns them to canopies based upon the distance measure and thresholds. Once it is done, in close(), it computes and outputs the canopy centroids to the reducer using a constant key. The canopy reducer sees the entire set of centroids, and clusters them again into the final canopy centroids that are output. This set of centroids will then be loaded into all clustering mappers, during configure(), for the final clustering. Thinking about your suggestion; if the canopy mapper only maintains canopy centers, and outputs each point keyed by its canopyCenterId (perhaps multiple times if a point is covered by more than one canopy) to a combiner, and if the combiner then sums all of its points to compute the centroid for output to the canopy reducer, then I won't have to be outputting stuff during close(). While that seems to work, it doesn't feel right. Using a combiner in this manner would avoid that. Did I get it<grin>? Jeff -----Original Message----- From: Ted Dunning [mailto:[EMAIL PROTECTED] Sent: Saturday, February 09, 2008 7:07 PM To: [EMAIL PROTECTED] Subject: Re: Best Practice? Hmmm.... I think that computing centroids in the mapper may not be the best idea. A different structure that would work well is to use the mapper to assign data records to centroids and use the centroid number as the key for the reduce key. Then the reduce itself can compute the centroids. You can read the old centroids from HDFS in the configure method of the mapper. Lather, rinse, repeat. This process avoids moving large amounts of data through the configuration process. This method can be extended to more advanced approaches such as Gaussian mixtures by emitting each input record multiple times with multiple centroid keys and a strength of association. Computing centroids in the mapper works well in that it minimizes the amount of data that is passed to the reducers, but it critically depends on the availability of sufficient statistic for computing cluster centroids. This works fine for Gaussian processes (aka k-means), but there are other mixture models that require fancier updates than this. Computing centroids in the reducer allows you avoid your problem with the output collector. If sufficient statistics like sums (means) are available then you can use a combiner to do the reduction incrementally and avoid moving too much data around. The reducer will still have to accumulate these partial updates for final output, but it won't have to compute very much of them. All of this is completely analogous to word-counting, actually. You don't accumulate counts in the mapper; you accumulate partial sums in the combiner and final sums in the reducer. On 2/9/08 4:21 PM, "Jeff Eastman" <[EMAIL PROTECTED]> wrote: > Thanks Aaron, I missed that one. Now I have my configuration information > in my mapper. In the mapper, I'm computing cluster centroids by reading > all the input points and assigning them to clusters. I don't actually > store the points in the mapper, just the evolving centroids. > > I'm trying to wait until close() to output the cluster centroids to the > reducer, but the OutputCollector is not available. Is there a way to do > this, or do I need to backtrack? > > Jeff > >
