Re: MAHOUT-236 Cluster Evaluation Tools?

Jeff Eastman Fri, 09 Apr 2010 10:51:45 -0700

That's not what I get from the paper. Certainly, the cluster center isthe first representative point. But the paper talks about subsequentlyiterating through the clustered points to find the farthest point fromthe previously-selected representative points (RPs) and then adding thatas another representative point. After a few such iterations, a set ofRPs is developed for each cluster that defines the extreme pointsobserved within the cluster. This is especially useful for non-sphericalclusters, such as those returned by mean shift and Dirichlet asymmetricmodels. Then, in the final stage, the RPs in each cluster are comparedand the closest RPs are used to compute CDbw. The final calculation canbe done in memory since the number of clusters and RPs is well-boundedby then.

I get that each RP iteration takes place over all of the clusteredpoints and would require a new MR job for each iteration. I imagineinitializing the mappers and reducers with the set of clusters and theirRPs. Then each mapper processes a subset of all clustered points,finally outputting the farthest it has seen for each cluster. Thereducer gets this information and selects the RP that is absolutely themost distant, outputting it with the clusters+RPs for the nextiteration. This is a lot like the way Dirichlet works now, outputtingstate to be used for the next iteration over the entire point set. Wewould need to allow a DistanceMeasure to be specified for this phase.

Currently, only canopy and kMeans actually produce their clusteredpoints. Dirichlet points could be clustered by assigning each point tothe model with the largest pdf (or even to more than one based upon auser-settable pdf threshold). Fuzzy kMeans would need to make similarassignments. MeanShift point ids are currently retained in its clusterstate but there is no step to build clustered points like canopy andkMeans do. Some work would be needed here too, as we need a uniformrepresentation for clustered points.

Finally, I'd like to review the output file naming conventions acrossall the clustering algorithms and converge on a single nomenclature thatis common across all jobs.


Robin Anil wrote:

Cluster center itself is a representative point. One pass over the data will
get us that close enough points. Or exhaustively, we can just add it in the
Kmeans Mapper and update a counter maybe?

Robin

Re: MAHOUT-236 Cluster Evaluation Tools?

Reply via email to