Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/5267#issuecomment-126527910
  
    @yu-iskw I made one pass, but I think we need more time on this PR. Some 
high-level comments:
    
    1. Documentation is not sufficient. For example, the criterion used to 
split clusters is not documented. We only mention that the implementation is 
based on the paper, but we didn't describe the algorithm nor how it differs 
from the algorithm used in the paper. There are also several public methods 
without JavaDoc.
    2. Many methods should be private rather than package private. If we want 
to test some utility functions, we can put them under a utility class/object 
and mark the object package private but methods under it public. Otherwise, it 
is hard to read the unit tests.
    3. This PR adds many public APIs, especially on tree cluster nodes. We need 
to discuss the names as well as how to keep them minimal.
    4. Correctness. I didn't check the details, but the variance calculation is 
wrong (line 337).
    5. Code style issues.
    
    For 1), could you add more description about the algorithms? If I have some 
time before 1.5, I can send you a PR to fix some issues. But we may miss the 
1.5 release.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to