The thing is that somehow I ended up not using clusterLogFactor in
that capacity at all.
Currently it's set to 10 by default (way more that I assume the theta
would imply).

I agree that we don't always know N, but I'd argue that we can
estimate a decent log N, especially since using StreamingKMeans in a
MapReduce implies you're doing processing the points in a batch and
know how many there are.

Currently we're doing clusterLogFactor *
Math.log(numProcessedDataPoints) whereas we should in fact be doing
clusterFudgeFactor * numClusters * Math.log(numProcessedDataPoints) so
we can estimate log N from c * log n.
While it might take a while to get n to approach N, we're using n^c to
estimate N, and after n grows larger than N^(1/c), we're
overestimating the number of clusters.

I guess what I'm asking is why do we even bother to have a dynamic
number of clusters? We're using the sketch to do ball k-means anyway,
aren't we?


On Thu, Jan 3, 2013 at 2:14 AM, Ted Dunning <[email protected]> wrote:
> Well, the point of clusterLogFactor was original to be c in the expression
> c * k * log N.  The idea was that this would normally be in the range from
> 1 to 3 and would be a fudge factor to increase k log N.  This is useful
> where we know k, but not N.
>
> On the other hand, in some cases we can just pick the maximum number of
> sketch clusters and we just want to set it so.  This is very useful in
> testing.  This number is almost always larger than the value of c * k * log
> N so the max is pretty decent as a way of choosing between the two.
>
> I agree that this should be normalized and documented, but I think we
> probably will want two behaviors until we really understand how to pick an
> adaptive limit.  I am happy if you have some way to expressive this well.
>
> On Wed, Jan 2, 2013 at 3:38 PM, Dan Filimon 
> <[email protected]>wrote:
>
>> Ted, I've been meaning to ask you about this.
>> Currently, we have a parameter called clusterLogFactor [1] that we
>> multiply by the number of points seen so far.
>>
>> This is (I guess) meant to behave like the k*log(n) recommended value
>> for the number of clusters in the paper. So, clusterLogFactor should
>> actually be k (the number of clusters).
>>
>> What I'm saying here is...
>>
>> We get a numClusters parameter anyway. Currently I set this to
>> k*log(N) (where N is the total number of points at the beginning).
>>
>> I propose that instead of having two confusing parameters:
>> estimatedNumClusters and clusterLogFactor, to just have one,
>> numClusters that has the same semantics as in BallKMeans.
>> It's about time these were properly documented.
>>
>> Additionally, I'd remove the max at line 232.
>>
>> How about it?
>>
>> [1]
>> https://github.com/dfilimon/knn/blob/d6891060b5488e492fd4bcc50343211b8d7da1dd/src/main/java/org/apache/mahout/knn/cluster/StreamingKMeans.java#L47
>>

Reply via email to