[ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612847#comment-14612847
 ] 

Feynman Liang edited comment on SPARK-5016 at 7/3/15 5:20 AM:
--------------------------------------------------------------

I did some [perf testing | 
https://gist.github.com/feynmanliang/70d79c23dffc828939ec] and it shows that 
distributing the Gaussians does yield a significant improvement in performance 
when the number of clusters and dimensionality of the data is sufficiently 
large (>30 dimensions, >10 clusters).

In particular, the "typical" use case of 40 dimensions and 10k clusters gains 
about 15 seconds in runtime when distributing the Gaussians.


was (Author: fliang):
I did some [perf 
testing](https://gist.github.com/feynmanliang/70d79c23dffc828939ec) and it 
shows that distributing the Gaussians does yield a significant improvement in 
performance when the number of clusters and dimensionality of the data is 
sufficiently large (>30 dimensions, >10 clusters).

In particular, the "typical" use case of 40 dimensions and 10k clusters gains 
about 15 seconds in runtime when distributing the Gaussians.

> GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-5016
>                 URL: https://issues.apache.org/jira/browse/SPARK-5016
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.2.0
>            Reporter: Joseph K. Bradley
>              Labels: clustering
>
> If numFeatures or k are large, GMM EM should distribute the matrix inverse 
> computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to