[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831622#action_12831622 ]
Ted Dunning commented on MAHOUT-153: ------------------------------------ I have been thinking about this problem a bit, particularly with respect to the obvious problems of parallelizing this very sequential algorithm in the context of the large sparse data that is common with scalable data mining. What I have come up with is a map-reduce approximate algorithm that might provide good speedup and scalability. The idea is that each mapper is essentially doing the k-mean++ starting point selection on the data split it sees. The output from each mapper is a set of potential starting centroids and the combined output is a (too large) set of centroid candidates. The k-means++ starting point selection algorithm can then be applied to this set of candidates to get the final set of initial centroids. Moreover, for sparse data, it is fairly common to find points that have very low similarity to previous points. This suggests that the mappers can apply an one-pass version of the starting point selection algorithm to their data. This one pass algorithm would keep the set of starting points selected so far as well as an on-line estimate of distribution of distances of points to the set of already selected centroids. The distribution estimate should take the form of estimating the 1 - k / n quantile of the distance distribution where k is the desired number of starting points and n is the best estimate of the points that will be seen by the mapper. N can be the number of points seen so far, but after just a few points have been seen, this estimate can be refined based on the average size of points and the total size of the split being examined by the mapper. The one pass algorithm is thus: {noformat} centroids = set.of(first data point) for each input point p update quantile_estimate using distance of p to all centroids if min(distance(centroids_i, p)) > quantile_estimate centroids.add(p) {noformat} This algorithm will be too lax at the beginning, but poor starting points will be pruned in the second phase when all starting points will be re-examined to select a final set. > Implement kmeans++ for initial cluster selection in kmeans > ---------------------------------------------------------- > > Key: MAHOUT-153 > URL: https://issues.apache.org/jira/browse/MAHOUT-153 > Project: Mahout > Issue Type: New Feature > Components: Clustering > Affects Versions: 0.2 > Environment: OS Independent > Reporter: Panagiotis Papadimitriou > Assignee: Ted Dunning > Fix For: 0.4 > > Attachments: Mahout-153.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > The current implementation of k-means includes the following algorithms for > initial cluster selection (seed selection): 1) random selection of k points, > 2) use of canopy clusters. > I plan to implement k-means++. The details of the algorithm are available > here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. > Design Outline: I will create an abstract class SeedGenerator and a subclass > KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will > become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.