That's not what I get from the paper. Certainly, the cluster center is
the first representative point. But the paper talks about subsequently
iterating through the clustered points to find the farthest point from
the previously-selected representative points (RPs) and then adding that
as another representative point. After a few such iterations, a set of
RPs is developed for each cluster that defines the extreme points
observed within the cluster. This is especially useful for non-spherical
clusters, such as those returned by mean shift and Dirichlet asymmetric
models. Then, in the final stage, the RPs in each cluster are compared
and the closest RPs are used to compute CDbw. The final calculation can
be done in memory since the number of clusters and RPs is well-bounded
by then.
I get that each RP iteration takes place over all of the clustered
points and would require a new MR job for each iteration. I imagine
initializing the mappers and reducers with the set of clusters and their
RPs. Then each mapper processes a subset of all clustered points,
finally outputting the farthest it has seen for each cluster. The
reducer gets this information and selects the RP that is absolutely the
most distant, outputting it with the clusters+RPs for the next
iteration. This is a lot like the way Dirichlet works now, outputting
state to be used for the next iteration over the entire point set. We
would need to allow a DistanceMeasure to be specified for this phase.
Currently, only canopy and kMeans actually produce their clustered
points. Dirichlet points could be clustered by assigning each point to
the model with the largest pdf (or even to more than one based upon a
user-settable pdf threshold). Fuzzy kMeans would need to make similar
assignments. MeanShift point ids are currently retained in its cluster
state but there is no step to build clustered points like canopy and
kMeans do. Some work would be needed here too, as we need a uniform
representation for clustered points.
Finally, I'd like to review the output file naming conventions across
all the clustering algorithms and converge on a single nomenclature that
is common across all jobs.
Robin Anil wrote:
Cluster center itself is a representative point. One pass over the data will
get us that close enough points. Or exhaustively, we can just add it in the
Kmeans Mapper and update a counter maybe?
Robin