Github user freeman-lab commented on the pull request:
https://github.com/apache/spark/pull/2906#issuecomment-69134341
Hi @yu-iskw and @rnowling , I've spent time reviewing the code and using it
in both Python and Scala. Overall great work, terrific to see my little gist
turned into something so refined and performant! =) I left lots of comments,
most minor, though documenting the caching behavior seems quite important.
The one significant addition I'd suggest is exposing another model output:
a list of the centers at all nodes in the learned tree. This would be in
addition to just the centers of the leaves, which is currently returned by
`getCenters` (or `clusterCenters` in Python). Maybe call it `getTreeCenters`.
It's basically given by `model.clusterTree.toSeq().map(_.center)`. But we
should make sure it's sorted so that it can be indexed using the values from
the merge list. In other words, if `Z` is the merge list, and row i indicates
that `Z[i,0]` and `Z[i,1]` were merged, we want to be able to get the centers
associated with those nodes by calling, for example,
`model.treeCenters[Z[i,0]]` and `model.treeCenters[Z[i,1]]`. What do you think?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]