Github user srowen commented on the issue:
https://github.com/apache/spark/pull/19340
I think you could reasonably define it either way; depends on how much you
think the cluster center is always defined as the mean (in "k-means")
regardless of distance function, or not.
However I think I'm more sympathetic now to defining the center as the
point that minimizes intra-cluster distance, which isn't quite the same thing.
In that case yes you must normalize the inputs in order for Euclidean distance
and cosine distance to match up.
Yeah you could tell the user that she can basically choose this behavior or
not by normalizing or not. I think I'd now believe that's more potential for
surprise than a useful choice. So yeah I'd also support going back and
normalizing the inputs in all cases here when cosine distance is used.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]