Good points on both accounts.

The first is easy to fix, but the second is relatively difficult.  I suppose
that we could start a min finder at the mean and let it find the minimum.
 Not sure how much it would help, but it would definitely be more correct.

Feel free to file a JIRA and suggest a patch for either one.

On Sun, Nov 7, 2010 at 5:20 PM, Mark Hall <[email protected]> wrote:

> The initial random selection of k cluster centers generated by
> RandomSeedGenerator looks like it is using reservoir sampling in order to
> give each input instance equal chance of being selected as a center.
> However, the code is not correct (if it is supposed to be reservoir
> sampling). After the first k instances have been added to the reservoir, the
> chance of an instance being selected to randomly replace a reservoir entry
> should be k / n, where n is the number of instances seen so far. The code
> uses constant probability 1/k.
>
> Another observation. I see that arbitrary distance functions can be used
> with KMeans, but, as far as I can see, the centroid is always computed by
> taking the component-wise mean of the instances in a cluster. The mean
> minimizes the squared error (i.e. Euclidean distance) but does not minimize
> the intra-cluster distance for other distance functions. E.g. for the
> Manhattan distance you need to take the median as the cluster center in
> order to minimize the distance.
>

Reply via email to