[ 
https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831622#action_12831622
 ] 

Ted Dunning commented on MAHOUT-153:
------------------------------------


I have been thinking about this problem a bit, particularly with respect to the 
obvious problems of parallelizing this very sequential algorithm in the context 
of the large sparse data that is common with scalable data mining.

What I have come up with is a map-reduce approximate algorithm that might 
provide good speedup and scalability.   

The idea is that each mapper is essentially doing the k-mean++ starting point 
selection on the data split it sees.  The output from each mapper is a set of 
potential starting centroids and the combined output is a (too large) set of 
centroid candidates.  The k-means++ starting point selection algorithm can then 
be applied to this set of candidates to get the final set of initial centroids.

Moreover, for sparse data, it is fairly common to find points that have very 
low similarity to previous points.  This suggests that the mappers can apply an 
one-pass version of the starting point selection algorithm to their data.

This one pass algorithm would keep the set of starting points selected so far 
as well as an on-line estimate of distribution of distances of points to the 
set of already selected centroids.  The distribution estimate should take the 
form of estimating the 1 - k / n quantile of the distance distribution where k 
is the desired number of starting points and n is the best estimate of the 
points that will be seen by the mapper.  N can be the number of points seen so 
far, but after just a few points have been seen, this estimate can be refined 
based on the average size of points and the total size of the split being 
examined by the mapper.

The one pass algorithm is thus:

{noformat}
centroids = set.of(first data point)
for each input point p
   update quantile_estimate using distance of p to all centroids
   if min(distance(centroids_i, p)) > quantile_estimate
      centroids.add(p)
{noformat}

This algorithm will be too lax at the beginning, but poor starting points will 
be pruned in the second phase when all starting points will be re-examined to 
select a final set.

> Implement kmeans++ for initial cluster selection in kmeans
> ----------------------------------------------------------
>
>                 Key: MAHOUT-153
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-153
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.2
>         Environment: OS Independent
>            Reporter: Panagiotis Papadimitriou
>            Assignee: Ted Dunning
>             Fix For: 0.4
>
>         Attachments: Mahout-153.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> The current implementation of k-means includes the following algorithms for 
> initial cluster selection (seed selection): 1) random selection of k points, 
> 2) use of canopy clusters.
> I plan to implement k-means++. The details of the algorithm are available 
> here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
> Design Outline: I will create an abstract class SeedGenerator and a subclass 
> KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will 
> become a subclass of SeedGenerator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to