Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19340
> That link also mentions that Matlab allows cosine distance.
http://www.mathworks.com/help/stats/kmeans.html?s_tid=gn_loc_drop
The link to Matlab doc explicitly describes how it computes centroid
clusters differently for the different, supported distance measures. For cosine
distance, the centroids are computed with normalized points, instead of the
mean of the points for Euclidean distance. In this part, seems to me Matlab's
approach is more comprehensive than RapidMiners which only takes the mean of
points.
I quickly looked at Spark's KMeans implementation, looks like we now also
compute the centroids as the mean of the points.
I'm not sure if this can be an issue in practice usage of KMeans and affect
its results or correctness. If we don't want to update centroids differently
for different distance measures. I think we should at least clarify it in
documents to warn users.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]