Hmmmm indeed, this is certainly food for thought. I'm cross-posting this
to Mahout since it bears upon my recent submission there. Here's what
that does, and also how I think I can incorporate these ideas into it
too.

Each canopy mapper sees only a subset of the points. It goes ahead and
assigns them to canopies based upon the distance measure and thresholds.
Once it is done, in close(), it computes and outputs the canopy
centroids to the reducer using a constant key.

The canopy reducer sees the entire set of centroids, and clusters them
again into the final canopy centroids that are output. This set of
centroids will then be loaded into all clustering mappers, during
configure(), for the final clustering.

Thinking about your suggestion; if the canopy mapper only maintains
canopy centers, and outputs each point keyed by its canopyCenterId
(perhaps multiple times if a point is covered by more than one canopy)
to a combiner, and if the combiner then sums all of its points to
compute the centroid for output to the canopy reducer, then I won't have
to be outputting stuff during close(). While that seems to work, it
doesn't feel right. Using a combiner in this manner would avoid that.

Did I get it<grin>?
Jeff



-----Original Message-----
From: Ted Dunning [mailto:[EMAIL PROTECTED] 
Sent: Saturday, February 09, 2008 7:07 PM
To: [EMAIL PROTECTED]
Subject: Re: Best Practice?



Hmmm....

I think that computing centroids in the mapper may not be the best idea.

A different structure that would work well is to use the mapper to
assign
data records to centroids and use the centroid number as the key for the
reduce key.  Then the reduce itself can compute the centroids.  You can
read
the old centroids from HDFS in the configure method of the mapper.
Lather,
rinse, repeat.

This process avoids moving large amounts of data through the
configuration
process.

This method can be extended to more advanced approaches such as Gaussian
mixtures by emitting each input record multiple times with multiple
centroid
keys and a strength of association.

Computing centroids in the mapper works well in that it minimizes the
amount
of data that is passed to the reducers, but it critically depends on the
availability of sufficient statistic for computing cluster centroids.
This
works fine for Gaussian processes (aka k-means), but there are other
mixture
models that require fancier updates than this.

Computing centroids in the reducer allows you avoid your problem with
the
output collector.  If sufficient statistics like sums (means) are
available
then you can use a combiner to do the reduction incrementally and avoid
moving too much data around.  The reducer will still have to accumulate
these partial updates for final output, but it won't have to compute
very
much of them.

All of this is completely analogous to word-counting, actually.  You
don't
accumulate counts in the mapper; you accumulate partial sums in the
combiner
and final sums in the reducer.




On 2/9/08 4:21 PM, "Jeff Eastman" <[EMAIL PROTECTED]> wrote:

> Thanks Aaron, I missed that one. Now I have my configuration
information
> in my mapper. In the mapper, I'm computing cluster centroids by
reading
> all the input points and assigning them to clusters. I don't actually
> store the points in the mapper, just the evolving centroids.
> 
> I'm trying to wait until close() to output the cluster centroids to
the
> reducer, but the OutputCollector is not available. Is there a way to
do
> this, or do I need to backtrack?
> 
> Jeff
> 
> 

Reply via email to