[
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612847#comment-14612847
]
Feynman Liang edited comment on SPARK-5016 at 7/3/15 5:20 AM:
--------------------------------------------------------------
I did some [perf testing |
https://gist.github.com/feynmanliang/70d79c23dffc828939ec] and it shows that
distributing the Gaussians does yield a significant improvement in performance
when the number of clusters and dimensionality of the data is sufficiently
large (>30 dimensions, >10 clusters).
In particular, the "typical" use case of 40 dimensions and 10k clusters gains
about 15 seconds in runtime when distributing the Gaussians.
was (Author: fliang):
I did some [perf
testing](https://gist.github.com/feynmanliang/70d79c23dffc828939ec) and it
shows that distributing the Gaussians does yield a significant improvement in
performance when the number of clusters and dimensionality of the data is
sufficiently large (>30 dimensions, >10 clusters).
In particular, the "typical" use case of 40 dimensions and 10k clusters gains
about 15 seconds in runtime when distributing the Gaussians.
> GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
> ---------------------------------------------------------------------------
>
> Key: SPARK-5016
> URL: https://issues.apache.org/jira/browse/SPARK-5016
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Affects Versions: 1.2.0
> Reporter: Joseph K. Bradley
> Labels: clustering
>
> If numFeatures or k are large, GMM EM should distribute the matrix inverse
> computation for Gaussian initialization.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]