That's not what I get from the paper. Certainly, the cluster center is the first representative point. But the paper talks about subsequently iterating through the clustered points to find the farthest point from the previously-selected representative points (RPs) and then adding that as another representative point. After a few such iterations, a set of RPs is developed for each cluster that defines the extreme points observed within the cluster. This is especially useful for non-spherical clusters, such as those returned by mean shift and Dirichlet asymmetric models. Then, in the final stage, the RPs in each cluster are compared and the closest RPs are used to compute CDbw. The final calculation can be done in memory since the number of clusters and RPs is well-bounded by then.

I get that each RP iteration takes place over all of the clustered points and would require a new MR job for each iteration. I imagine initializing the mappers and reducers with the set of clusters and their RPs. Then each mapper processes a subset of all clustered points, finally outputting the farthest it has seen for each cluster. The reducer gets this information and selects the RP that is absolutely the most distant, outputting it with the clusters+RPs for the next iteration. This is a lot like the way Dirichlet works now, outputting state to be used for the next iteration over the entire point set. We would need to allow a DistanceMeasure to be specified for this phase.

Currently, only canopy and kMeans actually produce their clustered points. Dirichlet points could be clustered by assigning each point to the model with the largest pdf (or even to more than one based upon a user-settable pdf threshold). Fuzzy kMeans would need to make similar assignments. MeanShift point ids are currently retained in its cluster state but there is no step to build clustered points like canopy and kMeans do. Some work would be needed here too, as we need a uniform representation for clustered points.

Finally, I'd like to review the output file naming conventions across all the clustering algorithms and converge on a single nomenclature that is common across all jobs.

Robin Anil wrote:
Cluster center itself is a representative point. One pass over the data will
get us that close enough points. Or exhaustively, we can just add it in the
Kmeans Mapper and update a counter maybe?

Robin


Reply via email to