Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/5267#issuecomment-126527910
@yu-iskw I made one pass, but I think we need more time on this PR. Some
high-level comments:
1. Documentation is not sufficient. For example, the criterion used to
split clusters is not documented. We only mention that the implementation is
based on the paper, but we didn't describe the algorithm nor how it differs
from the algorithm used in the paper. There are also several public methods
without JavaDoc.
2. Many methods should be private rather than package private. If we want
to test some utility functions, we can put them under a utility class/object
and mark the object package private but methods under it public. Otherwise, it
is hard to read the unit tests.
3. This PR adds many public APIs, especially on tree cluster nodes. We need
to discuss the names as well as how to keep them minimal.
4. Correctness. I didn't check the details, but the variance calculation is
wrong (line 337).
5. Code style issues.
For 1), could you add more description about the algorithms? If I have some
time before 1.5, I can send you a PR to fix some issues. But we may miss the
1.5 release.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]