[ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14312269#comment-14312269
 ] 

Travis Galoppo commented on SPARK-5016:
---------------------------------------

The k Gaussians are updated with code that right now looks like

{code}
      var i = 0
      while (i < k) {
        val mu = sums.means(i) / sums.weights(i)
        BLAS.syr(-sums.weights(i), 
Vectors.fromBreeze(mu).asInstanceOf[DenseVector],
          Matrices.fromBreeze(sums.sigmas(i)).asInstanceOf[DenseMatrix])
        weights(i) = sums.weights(i) / sumWeights
        gaussians(i) = new MultivariateGaussian(mu, sums.sigmas(i) / 
sums.weights(i))
        i = i + 1
      }
{code}

... the matrix inversion (or, in reality, partial inversion... the inverse is 
not explicitly calculated) occurs during the creation of the 
MultivariateGaussian objects...  this code could be parallelized something like:

{code}
    val (ws, gs) = sc.parallelize(0 until k).map{ i => 
          val mu = sums.means(i) / sums.weights(i)
          BLAS.syr(-sums.weights(i), 
Vectors.fromBreeze(mu).asInstanceOf[DenseVector],
            Matrices.fromBreeze(sums.sigmas(i)).asInstanceOf[DenseMatrix])
          val weight = sums.weights(i) / sumWeights
          val gaussian = new MultivariateGaussian(mu, sums.sigmas(i) / 
sums.weights(i))
          (weight, gaussian)
      }.collect.unzip
      
      (0 until k).foreach{ i =>
        weights(i) = ws(i)
        gaussians(i) = gs(i)
      }
{code}

... effectively distributing the computation of the k MutlivariateGaussians 
(and their weights).  

As for the threshold values for k / numFeatures... this is probably a function 
of cluster size and interconnect speed.  These thresholds should probably be 
optional parameters to GaussianMixture.  Personally, I would vote for the 
default behavior to not perform this parallelization, and let the user decide 
when the time is right to allow it.


> GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-5016
>                 URL: https://issues.apache.org/jira/browse/SPARK-5016
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.2.0
>            Reporter: Joseph K. Bradley
>
> If numFeatures or k are large, GMM EM should distribute the matrix inverse 
> computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to