Re: Review Request: MAHOUT-1224: Add the option of running a StreamingKMeans pass in the Reducer before BallKMeans

Ted Dunning Tue, 21 May 2013 14:41:08 -0700

On Tue, May 21, 2013 at 1:47 AM, Dan Filimon <[email protected]>wrote:


> So if you want to *totally* anal about this, you have to deal with the fact 
> that the threshold on some mapper inputs stays low and on others goes high.  
> In such a case, if the large threshold stuff comes first, bad things could 
> happen.
>
> One fix would be to emit the thresholds with special keys that put them ahead 
> of all of the centroids.  You could then pick the smallest of the thresholds 
> that you see.
>
> That is a pain in the ass for low probability gain, it seems to me.
>
>  Yes, that's a fair point.
> The number of cluster is simply the same as the number of clusters requested 
> from the mappers (each mapper is supposed to generate the same k log (n / m) 
> number of clusters).
>
> But k log (n/m) is already a lower bound for all of the thresholds we'd get 
> from the mappers. So by picking this, we'll never over-estimate how many 
> clusters should in fact be generated. And that's fine, since we're adjusting 
> it as we're running SKM.
>
>
I think I wasn't clear.  k log (n/m) is a bound on the number of points.
 It has nothing to do with the cluster-attach-if-close threshold.

Re: Review Request: MAHOUT-1224: Add the option of running a StreamingKMeans pass in the Reducer before BallKMeans

Reply via email to