[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323217#comment-14323217 ]
Joseph K. Bradley commented on SPARK-5016: ------------------------------------------ Hi all, (back with Internet now) What I had in mind was parallelizing only the construction of MultivariateGaussian, which has an expensive initialization (doing SVD), in these two places: * [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L152] * [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L177] Computing means and covariances can remain as is. I'd be OK with a heuristic choice of whether to do the inverses on the driver. E.g., something trivial like numFeatures <= 10 && k <= 10 can be done on the driver, and everything else gets distributed. I'd vote against making it another parameter, but we could add that later on if users need to adjust the threshold. > GaussianMixtureEM should distribute matrix inverse for large numFeatures, k > --------------------------------------------------------------------------- > > Key: SPARK-5016 > URL: https://issues.apache.org/jira/browse/SPARK-5016 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.2.0 > Reporter: Joseph K. Bradley > > If numFeatures or k are large, GMM EM should distribute the matrix inverse > computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org