Ted, I've been meaning to ask you about this.
Currently, we have a parameter called clusterLogFactor [1] that we
multiply by the number of points seen so far.

This is (I guess) meant to behave like the k*log(n) recommended value
for the number of clusters in the paper. So, clusterLogFactor should
actually be k (the number of clusters).

What I'm saying here is...

We get a numClusters parameter anyway. Currently I set this to
k*log(N) (where N is the total number of points at the beginning).

I propose that instead of having two confusing parameters:
estimatedNumClusters and clusterLogFactor, to just have one,
numClusters that has the same semantics as in BallKMeans.
It's about time these were properly documented.

Additionally, I'd remove the max at line 232.

How about it?

[1] 
https://github.com/dfilimon/knn/blob/d6891060b5488e492fd4bcc50343211b8d7da1dd/src/main/java/org/apache/mahout/knn/cluster/StreamingKMeans.java#L47

Reply via email to