[ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323217#comment-14323217
 ] 

Joseph K. Bradley commented on SPARK-5016:
------------------------------------------

Hi all, (back with Internet now)

What I had in mind was parallelizing only the construction of 
MultivariateGaussian, which has an expensive initialization (doing SVD), in 
these two places:
* 
[https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L152]
* 
[https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L177]

Computing means and covariances can remain as is.

I'd be OK with a heuristic choice of whether to do the inverses on the driver.  
E.g., something trivial like numFeatures <= 10 && k <= 10 can be done on the 
driver, and everything else gets distributed.  I'd vote against making it 
another parameter, but we could add that later on if users need to adjust the 
threshold.

> GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-5016
>                 URL: https://issues.apache.org/jira/browse/SPARK-5016
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.2.0
>            Reporter: Joseph K. Bradley
>
> If numFeatures or k are large, GMM EM should distribute the matrix inverse 
> computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to