KMeans initial k centers

Mark Hall Sun, 07 Nov 2010 17:21:23 -0800

Hi,

I'm very new to Mahout so please forgive any unintentional stupidity onmy part :-) I was just browsing the clustering code for KMeans and havea couple of questions.

The initial random selection of k cluster centers generated byRandomSeedGenerator looks like it is using reservoir sampling in orderto give each input instance equal chance of being selected as a center.However, the code is not correct (if it is supposed to be reservoirsampling). After the first k instances have been added to the reservoir,the chance of an instance being selected to randomly replace a reservoirentry should be k / n, where n is the number of instances seen so far.The code uses constant probability 1/k.

Another observation. I see that arbitrary distance functions can be usedwith KMeans, but, as far as I can see, the centroid is always computedby taking the component-wise mean of the instances in a cluster. Themean minimizes the squared error (i.e. Euclidean distance) but does notminimize the intra-cluster distance for other distance functions. E.g.for the Manhattan distance you need to take the median as the clustercenter in order to minimize the distance.


Cheers,
Mark.

KMeans initial k centers

Reply via email to