Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19340
  
    > That link also mentions that Matlab allows cosine distance. 
http://www.mathworks.com/help/stats/kmeans.html?s_tid=gn_loc_drop
    
    The link to Matlab doc explicitly describes how it computes centroid 
clusters differently for the different, supported distance measures. For cosine 
distance, the centroids are computed with normalized points, instead of the 
mean of the points for Euclidean distance. In this part, seems to me Matlab's 
approach is more comprehensive than RapidMiners which only takes the mean of 
points.
    
    I quickly looked at Spark's KMeans implementation, looks like we now also 
compute the centroids as the mean of the points.
    
    I'm not sure if this can be an issue in practice usage of KMeans and affect 
its results or correctness. If we don't want to update centroids differently 
for different distance measures. I think we should at least clarify it in 
documents to warn users.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to