Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/1293#issuecomment-47961640
  
    The problem lies in `initKMeansParallel`, the implementation of k-means|| 
algorithm. Since it selects at most the centers as many as the data number, 
when calling `LocalKMeans.kMeansPlusPlus` at the end of `initKMeansParallel`, 
`kMeansPlusPlus` would throw this exception.
    
    I can slightly modify `kMeansPlusPlus` to avoid this exception by selected 
chosen centers to fill the gap between cluster numbers and data number. But 
this approach might not be appropriate because it is not the problem of the 
algorithm.
    
    I also think about whether it is worth to check that by scanning all data. 
But since it is only counting and no other computations involved, it might be 
acceptable still. In fact, there are also many map operations on the data later 
in clustering. Comparing with these map ops, `data.count()` should be 
lightweight more? Or it is unnecessary to check that? Any suggestions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to