The thing is that somehow I ended up not using clusterLogFactor in that capacity at all. Currently it's set to 10 by default (way more that I assume the theta would imply).
I agree that we don't always know N, but I'd argue that we can estimate a decent log N, especially since using StreamingKMeans in a MapReduce implies you're doing processing the points in a batch and know how many there are. Currently we're doing clusterLogFactor * Math.log(numProcessedDataPoints) whereas we should in fact be doing clusterFudgeFactor * numClusters * Math.log(numProcessedDataPoints) so we can estimate log N from c * log n. While it might take a while to get n to approach N, we're using n^c to estimate N, and after n grows larger than N^(1/c), we're overestimating the number of clusters. I guess what I'm asking is why do we even bother to have a dynamic number of clusters? We're using the sketch to do ball k-means anyway, aren't we? On Thu, Jan 3, 2013 at 2:14 AM, Ted Dunning <[email protected]> wrote: > Well, the point of clusterLogFactor was original to be c in the expression > c * k * log N. The idea was that this would normally be in the range from > 1 to 3 and would be a fudge factor to increase k log N. This is useful > where we know k, but not N. > > On the other hand, in some cases we can just pick the maximum number of > sketch clusters and we just want to set it so. This is very useful in > testing. This number is almost always larger than the value of c * k * log > N so the max is pretty decent as a way of choosing between the two. > > I agree that this should be normalized and documented, but I think we > probably will want two behaviors until we really understand how to pick an > adaptive limit. I am happy if you have some way to expressive this well. > > On Wed, Jan 2, 2013 at 3:38 PM, Dan Filimon > <[email protected]>wrote: > >> Ted, I've been meaning to ask you about this. >> Currently, we have a parameter called clusterLogFactor [1] that we >> multiply by the number of points seen so far. >> >> This is (I guess) meant to behave like the k*log(n) recommended value >> for the number of clusters in the paper. So, clusterLogFactor should >> actually be k (the number of clusters). >> >> What I'm saying here is... >> >> We get a numClusters parameter anyway. Currently I set this to >> k*log(N) (where N is the total number of points at the beginning). >> >> I propose that instead of having two confusing parameters: >> estimatedNumClusters and clusterLogFactor, to just have one, >> numClusters that has the same semantics as in BallKMeans. >> It's about time these were properly documented. >> >> Additionally, I'd remove the max at line 232. >> >> How about it? >> >> [1] >> https://github.com/dfilimon/knn/blob/d6891060b5488e492fd4bcc50343211b8d7da1dd/src/main/java/org/apache/mahout/knn/cluster/StreamingKMeans.java#L47 >>
