Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/19340
The updating of centers should be viewed as the **M-step** in EM algorithm,
in which some objective is optimized.
Since cosine similarity do not take vector-norm into account:
1. the optimal solution of normized points (`V`) should also be optimal to
original points
2. scaled solution (`k*V, k>0`) is also optimal to both normized points and
original points
If we want to optimize intra-cluster cosine similarity (like Matlab), then
arithmetic mean of normized points should be a better solution than arithmetic
mean of original points.
Suppose two 2D points (x=0,y=1) and (x=100,y=0):
1. If we choose the arithmetic mean (x=50,y=0.5) as the center, the sum of
cosine similarity is about 1.0;
2. If we choose the arithmetic mean of normized points (x=0.5,y=0.5), the
sum of cosine similarity is about 1.414;
3. this center can then be normized for computation convenience in
following assignment (E-step) or prediction.
Since `VectorWithNorm` is used as the input, norms of vectors are already
computed, then I think we only need to update [this
line](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L314)
to
```
if (point.norm > 0) {
axpy(1.0 / point.norm, point.vector, sum)
}
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]