Github user sethah commented on the issue:

    https://github.com/apache/spark/pull/15450
  
    @srowen I'm not against the change per se, I was just hoping to understand 
how duplicate centers arise. In the case of `initRandom` sampling with 
replacement makes it possible to select the same initial centers, but it should 
be quite unlikely if there are much more unique data points than requested 
centers. Even when this happens, the algorithm should move the centers unless 
they never have any data assigned to them. Since the centers are double-valued 
points in the feature space, when we say duplicate centers do we mean literally 
duplicate or that `|c1 - c2|_p < eps` for some norm? 
    
    It seems to me that the problem of duplicate centers would not arise in 
most real-world use cases, but from comments on the JIRA it appears that 
assumption could be false. I think it's easier to assess the change if we 
understand what causes this situation. Do you have any insight? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to