Good points on both accounts. The first is easy to fix, but the second is relatively difficult. I suppose that we could start a min finder at the mean and let it find the minimum. Not sure how much it would help, but it would definitely be more correct.
Feel free to file a JIRA and suggest a patch for either one. On Sun, Nov 7, 2010 at 5:20 PM, Mark Hall <[email protected]> wrote: > The initial random selection of k cluster centers generated by > RandomSeedGenerator looks like it is using reservoir sampling in order to > give each input instance equal chance of being selected as a center. > However, the code is not correct (if it is supposed to be reservoir > sampling). After the first k instances have been added to the reservoir, the > chance of an instance being selected to randomly replace a reservoir entry > should be k / n, where n is the number of instances seen so far. The code > uses constant probability 1/k. > > Another observation. I see that arbitrary distance functions can be used > with KMeans, but, as far as I can see, the centroid is always computed by > taking the component-wise mean of the instances in a cluster. The mean > minimizes the squared error (i.e. Euclidean distance) but does not minimize > the intra-cluster distance for other distance functions. E.g. for the > Manhattan distance you need to take the median as the cluster center in > order to minimize the distance. >
