[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

srowen Mon, 17 Oct 2016 03:51:08 -0700

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/15450
  
    @sethah I agree that when there are lots of unique points (>> k) then this 
is almost certain to not happen, and that's most real-world use cases, but the 
question indeed is what should happen when this is not the case. In that sense, 
this change only affects corner cases so isn't really a big deal either way.
    
    Yes the one case is clear: sampling with replacement when the data set has 
< k (unique) points. It will always return k centroids, so must return 
duplicates. In this case, every point will be at distance 0 from some centroid 
and so I don't think the centroids can move apart. It stops in 1 iteration with 
the degenerate solution, with some centroids assigned 0 points. Not the end of 
the world but not exactly meaningful.
    
    The more interesting case is k-means ||. Of course, again, if there are < k 
unique points to start, in this case as well, returning k centroids means 
returning duplicates. Same argument there -- seems to be no value in returning 
k centroids.
    
    This is really the sum of the argument to me, regardless of what Derrick's 
case is.
    
    A twist: it's possible, but quite improbable, for k-means || to choose 
fewer than k unique centroids, when there are >= k distinct points. This is 
most likely when there are barely more than k distinct points. In that case 
it's possible that duplicated centroids do get pulled apart and do end up doing 
something meaningful. I am arguing this case is not worth dealing with because 
it's rare and it doesn't meaningfully harm the quality of the resulting 
clustering, but, that point is arguable.
    
    I am about 7/10 in favor of the change, certainly the bit about sampling 
without replacement, but the rest I could drop if there's any significant 
objection to it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return duplica...

Reply via email to