Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/19340
  
    I think you could reasonably define it either way; depends on how much you 
think the cluster center is always defined as the mean (in "k-means") 
regardless of distance function, or not.
    
    However I think I'm more sympathetic now to defining the center as the 
point that minimizes intra-cluster distance, which isn't quite the same thing. 
In that case yes you must normalize the inputs in order for Euclidean distance 
and cosine distance to match up.
    
    Yeah you could tell the user that she can basically choose this behavior or 
not by normalizing or not. I think I'd now believe that's more potential for 
surprise than a useful choice. So yeah I'd also support going back and 
normalizing the inputs in all cases here when cosine distance is used.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to