[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

mengxr Tue, 23 Jun 2015 14:53:07 -0700

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/5267#issuecomment-114652642
  
    Had an offline discussion with @yu-iskw:
    
    1. How to decide which cluster to split?
        The current implementation uses coordinate variances and its norm, 
which is not very common. We could either following the original paper and use 
entropy or use average distance. I prefer the latter for simplicity.
    2. How to initialize cluster centers after each split?
        The current implementation adds noise to the current center to make two 
initial points. This has risk of having empty clusters in the split, which is 
called `failure` in the code. However, as long as the cluster contains more 
than one point, we can always choose initial centers such that it won't fail. 
For example, we can pick the current center and then the furtherest point in 
the cluster. Or we can follow `k-means++` to sample a point based on their 
distance to the current center.
    3. When to stop?
        Users can specify the max number of clusters. In a follow-up PR, we can 
add more stop criterions.
    4. How to make predictions?
        The current implementation finds the closest leaf centers. However, the 
complexity is about `k/2` on a complete binary tree, where `k` is the number of 
clusters. If we take the top-down approach, it is `log(k)`.
    5. Split this PR into small ones.
        a. Update this PR addressing 1), 2), and 3), without Python API.
        b. A follow-up PR for 4).
        c. Model save/load.
        d. Python API.
        e. User-guide.
    6. For the name, I would recommend `BisectingKMeans`, which is more 
accurate.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

Reply via email to