Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3022#issuecomment-68398244
@tgaloppo @FlytxtRnD I made some JIRAs for the to-do items above.
I'd say the most important are:
* [Change predictMembership() to take an RDD, not the
GMM.](https://issues.apache.org/jira/browse/SPARK-5020)
* I did not notice that it took all of the GMM parameters. It should be
renamed and made internal, and a wrapper method predictMembership() should take
an RDD only.
* [Make MultivariateGaussian
public](https://issues.apache.org/jira/browse/SPARK-5018)
* [Update GMM API to use MultivariateGaussian instead of means,
covariances](https://issues.apache.org/jira/browse/SPARK-5019)
* (The Python API and user guide JIRAs from @mengxr should also be in this
list.)
It would be great to do:
* [SVD for Gaussian
initialization](https://issues.apache.org/jira/browse/SPARK-5017)
Some less critical ones are:
* [random seed](https://issues.apache.org/jira/browse/SPARK-5015)
* [If numFeatures or k are large, distribute matrix inverses for Gaussian
initialization.](https://issues.apache.org/jira/browse/SPARK-5016)
* [Be faster for SparseVector
inputs](https://issues.apache.org/jira/browse/SPARK-5021)
I removed the NAN JIRAs, but we should investigate numerical stability at
some point.
Please let me know if you'd like any assigned to you, and thanks in advance
for your work on this! If I'm able to work on one of the JIRAs, I'll make a
note on the JIRA page.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]