On Tue, May 21, 2013 at 1:47 AM, Dan Filimon <[email protected]>wrote:
> So if you want to *totally* anal about this, you have to deal with the fact > that the threshold on some mapper inputs stays low and on others goes high. > In such a case, if the large threshold stuff comes first, bad things could > happen. > > One fix would be to emit the thresholds with special keys that put them ahead > of all of the centroids. You could then pick the smallest of the thresholds > that you see. > > That is a pain in the ass for low probability gain, it seems to me. > > Yes, that's a fair point. > The number of cluster is simply the same as the number of clusters requested > from the mappers (each mapper is supposed to generate the same k log (n / m) > number of clusters). > > But k log (n/m) is already a lower bound for all of the thresholds we'd get > from the mappers. So by picking this, we'll never over-estimate how many > clusters should in fact be generated. And that's fine, since we're adjusting > it as we're running SKM. > > I think I wasn't clear. k log (n/m) is a bound on the number of points. It has nothing to do with the cluster-attach-if-close threshold.
