Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/19340
@mgaido91 @srowen I have the same concern as @Kevin-Ferret and @viirya
I don't find the normailization of vectors before training, and the update
of center seems incorrect.
The arithmetic mean of all points in the cluster is not naturally the new
cluster center:
For EUCLIDEAN distance, we need to update the center to minimize the square
lose, then we get the arithmetic mean as the closed-form solution;
For COSINE similarity, we need to update the center to *maximize the cosine
similarity*, the solution is also the arithmetic mean only if all vectors are
of unit length.
In matlab's doc for KMeans, it says "One minus the cosine of the included
angle between points (treated as vectors). Each centroid is the mean of the
points in that cluster, after *normalizing those points to unit Euclidean
length*."
I think RapidMiners's implementation of KMeans with cosine similarity is
wrong, if it just assign new center with the arithmetic mean.
Some reference:
[Spherical k-Means
Clustering](https://www.jstatsoft.org/article/view/v050i10/v50i10.pdf)
[Scikit-Learn's example: Clustering text documents using
k-means](http://scikit-learn.org/dev/auto_examples/text/plot_document_clustering.html)
https://stats.stackexchange.com/questions/299013/cosine-distance-as-similarity-measure-in-kmeans
https://www.quora.com/How-can-I-use-cosine-similarity-in-clustering-For-example-K-means-clustering
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]