Github user srowen commented on the issue:
https://github.com/apache/spark/pull/15450
@sethah I agree that when there are lots of unique points (>> k) then this
is almost certain to not happen, and that's most real-world use cases, but the
question indeed is what should happen when this is not the case. In that sense,
this change only affects corner cases so isn't really a big deal either way.
Yes the one case is clear: sampling with replacement when the data set has
< k (unique) points. It will always return k centroids, so must return
duplicates. In this case, every point will be at distance 0 from some centroid
and so I don't think the centroids can move apart. It stops in 1 iteration with
the degenerate solution, with some centroids assigned 0 points. Not the end of
the world but not exactly meaningful.
The more interesting case is k-means ||. Of course, again, if there are < k
unique points to start, in this case as well, returning k centroids means
returning duplicates. Same argument there -- seems to be no value in returning
k centroids.
This is really the sum of the argument to me, regardless of what Derrick's
case is.
A twist: it's possible, but quite improbable, for k-means || to choose
fewer than k unique centroids, when there are >= k distinct points. This is
most likely when there are barely more than k distinct points. In that case
it's possible that duplicated centroids do get pulled apart and do end up doing
something meaningful. I am arguing this case is not worth dealing with because
it's rare and it doesn't meaningfully harm the quality of the resulting
clustering, but, that point is arguable.
I am about 7/10 in favor of the change, certainly the bit about sampling
without replacement, but the rest I could drop if there's any significant
objection to it.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]