Github user viirya commented on the pull request:
https://github.com/apache/spark/pull/1293#issuecomment-47961640
The problem lies in `initKMeansParallel`, the implementation of k-means||
algorithm. Since it selects at most the centers as many as the data number,
when calling `LocalKMeans.kMeansPlusPlus` at the end of `initKMeansParallel`,
`kMeansPlusPlus` would throw this exception.
I can slightly modify `kMeansPlusPlus` to avoid this exception by selected
chosen centers to fill the gap between cluster numbers and data number. But
this approach might not be appropriate because it is not the problem of the
algorithm.
I also think about whether it is worth to check that by scanning all data.
But since it is only counting and no other computations involved, it might be
acceptable still. In fact, there are also many map operations on the data later
in clustering. Comparing with these map ops, `data.count()` should be
lightweight more? Or it is unnecessary to check that? Any suggestions?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---