Hi,
I'm very new to Mahout so please forgive any unintentional stupidity on
my part :-) I was just browsing the clustering code for KMeans and have
a couple of questions.
The initial random selection of k cluster centers generated by
RandomSeedGenerator looks like it is using reservoir sampling in order
to give each input instance equal chance of being selected as a center.
However, the code is not correct (if it is supposed to be reservoir
sampling). After the first k instances have been added to the reservoir,
the chance of an instance being selected to randomly replace a reservoir
entry should be k / n, where n is the number of instances seen so far.
The code uses constant probability 1/k.
Another observation. I see that arbitrary distance functions can be used
with KMeans, but, as far as I can see, the centroid is always computed
by taking the component-wise mean of the instances in a cluster. The
mean minimizes the squared error (i.e. Euclidean distance) but does not
minimize the intra-cluster distance for other distance functions. E.g.
for the Manhattan distance you need to take the median as the cluster
center in order to minimize the distance.
Cheers,
Mark.
- KMeans initial k centers Mark Hall
-