Hi,

I'm very new to Mahout so please forgive any unintentional stupidity on my part :-) I was just browsing the clustering code for KMeans and have a couple of questions.

The initial random selection of k cluster centers generated by RandomSeedGenerator looks like it is using reservoir sampling in order to give each input instance equal chance of being selected as a center. However, the code is not correct (if it is supposed to be reservoir sampling). After the first k instances have been added to the reservoir, the chance of an instance being selected to randomly replace a reservoir entry should be k / n, where n is the number of instances seen so far. The code uses constant probability 1/k.

Another observation. I see that arbitrary distance functions can be used with KMeans, but, as far as I can see, the centroid is always computed by taking the component-wise mean of the instances in a cluster. The mean minimizes the squared error (i.e. Euclidean distance) but does not minimize the intra-cluster distance for other distance functions. E.g. for the Manhattan distance you need to take the median as the cluster center in order to minimize the distance.

Cheers,
Mark.

Reply via email to