[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14312269#comment-14312269 ]
Travis Galoppo commented on SPARK-5016: --------------------------------------- The k Gaussians are updated with code that right now looks like {code} var i = 0 while (i < k) { val mu = sums.means(i) / sums.weights(i) BLAS.syr(-sums.weights(i), Vectors.fromBreeze(mu).asInstanceOf[DenseVector], Matrices.fromBreeze(sums.sigmas(i)).asInstanceOf[DenseMatrix]) weights(i) = sums.weights(i) / sumWeights gaussians(i) = new MultivariateGaussian(mu, sums.sigmas(i) / sums.weights(i)) i = i + 1 } {code} ... the matrix inversion (or, in reality, partial inversion... the inverse is not explicitly calculated) occurs during the creation of the MultivariateGaussian objects... this code could be parallelized something like: {code} val (ws, gs) = sc.parallelize(0 until k).map{ i => val mu = sums.means(i) / sums.weights(i) BLAS.syr(-sums.weights(i), Vectors.fromBreeze(mu).asInstanceOf[DenseVector], Matrices.fromBreeze(sums.sigmas(i)).asInstanceOf[DenseMatrix]) val weight = sums.weights(i) / sumWeights val gaussian = new MultivariateGaussian(mu, sums.sigmas(i) / sums.weights(i)) (weight, gaussian) }.collect.unzip (0 until k).foreach{ i => weights(i) = ws(i) gaussians(i) = gs(i) } {code} ... effectively distributing the computation of the k MutlivariateGaussians (and their weights). As for the threshold values for k / numFeatures... this is probably a function of cluster size and interconnect speed. These thresholds should probably be optional parameters to GaussianMixture. Personally, I would vote for the default behavior to not perform this parallelization, and let the user decide when the time is right to allow it. > GaussianMixtureEM should distribute matrix inverse for large numFeatures, k > --------------------------------------------------------------------------- > > Key: SPARK-5016 > URL: https://issues.apache.org/jira/browse/SPARK-5016 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.2.0 > Reporter: Joseph K. Bradley > > If numFeatures or k are large, GMM EM should distribute the matrix inverse > computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org