Github user Kevin-Ferret commented on the issue: https://github.com/apache/spark/pull/19340 @mgaido91 I actually needed something like that recently and I stumbled upon your PR (and JIRA, that I cannot update unfortunately). Your approach looks good to me but I was wondering : are you keeping the existing code to find the new centers, in the sense "newCenter = compute the arithmetic mean of all points in the cluster" ? (couldn't find any git diff on this piece https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L291-L298) This will lead to misleading results (your centroids won't minimize the within-cluster cosine distance anymore) unless you make sure to normalize all your input vectors AND your newly computed centroid (ie spherical k-means). I might be wrong/have misread the code, forgive me if that's the case. If you want to discuss further, what's the best place (assuming you don't want this PR derailed with my blabbering) ?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org