Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/5267#issuecomment-114652642
Had an offline discussion with @yu-iskw:
1. How to decide which cluster to split?
The current implementation uses coordinate variances and its norm,
which is not very common. We could either following the original paper and use
entropy or use average distance. I prefer the latter for simplicity.
2. How to initialize cluster centers after each split?
The current implementation adds noise to the current center to make two
initial points. This has risk of having empty clusters in the split, which is
called `failure` in the code. However, as long as the cluster contains more
than one point, we can always choose initial centers such that it won't fail.
For example, we can pick the current center and then the furtherest point in
the cluster. Or we can follow `k-means++` to sample a point based on their
distance to the current center.
3. When to stop?
Users can specify the max number of clusters. In a follow-up PR, we can
add more stop criterions.
4. How to make predictions?
The current implementation finds the closest leaf centers. However, the
complexity is about `k/2` on a complete binary tree, where `k` is the number of
clusters. If we take the top-down approach, it is `log(k)`.
5. Split this PR into small ones.
a. Update this PR addressing 1), 2), and 3), without Python API.
b. A follow-up PR for 4).
c. Model save/load.
d. Python API.
e. User-guide.
6. For the name, I would recommend `BisectingKMeans`, which is more
accurate.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]