Github user zhengruifeng commented on the issue:

    https://github.com/apache/spark/pull/19340
  
    The updating of centers should be viewed as the **M-step** in EM algorithm, 
in which some objective is optimized. 
    
    Since cosine similarity do not take vector-norm into account: 
    1. the optimal solution of normized points (`V`) should also be optimal to 
original points
    2. scaled solution (`k*V, k>0`) is also optimal to both normized points and 
original points
    
    If we want to optimize intra-cluster cosine similarity (like Matlab), then 
arithmetic mean of normized points should be a better solution than arithmetic 
mean of original points.
    
    Suppose two 2D points (x=0,y=1) and (x=100,y=0):
    
    1. If we choose the arithmetic mean (x=50,y=0.5) as the center,  the sum of 
cosine similarity is about 1.0;
    2. If we choose the arithmetic mean of normized points (x=0.5,y=0.5), the 
sum of cosine similarity is about 1.414;
    3. this center can then be normized for computation convenience in 
following assignment (E-step) or prediction.
    
    Since `VectorWithNorm` is used as the input, norms of vectors are already 
computed, then I think we only need to update [this 
line](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L314)
 to 
    ```
    if (point.norm > 0) {
      axpy(1.0 / point.norm, point.vector, sum)
    }
    ```
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to