[GitHub] spark pull request #15413: [SPARK-17847][ML] Reduce shuffled data size of Ga...

jkbradley Wed, 28 Dec 2016 14:59:11 -0800

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/15413#discussion_r94084884
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala ---
    @@ -356,13 +427,243 @@ class GaussianMixture @Since("2.0.0") (
       override def transformSchema(schema: StructType): StructType = {
         validateAndTransformSchema(schema)
       }
    +
    +  /**
    +   * Initialize weights and corresponding gaussian distributions at random.
    +   *
    +   * We start with uniform weights, a random mean from the data, and 
diagonal covariance matrices
    +   * using component variances derived from the samples.
    +   *
    +   * @param instances The training instances.
    +   * @param numClusters The number of clusters.
    +   * @param numFeatures The number of features of training instance.
    +   * @return The initialized weights and corresponding gaussian 
distributions. Note the
    +   *         covariance matrix of multivariate gaussian distribution is 
symmetric and
    +   *         we only save the upper triangular part as a dense vector.
    +   */
    +  private def initRandom(
    +      instances: RDD[Vector],
    +      numClusters: Int,
    +      numFeatures: Int): (Array[Double], Array[(DenseVector, 
DenseVector)]) = {
    +    val samples = instances.takeSample(withReplacement = true, numClusters 
* numSamples, $(seed))
    +    val weights: Array[Double] = Array.fill(numClusters)(1.0 / numClusters)
    +    val gaussians: Array[(DenseVector, DenseVector)] = 
Array.tabulate(numClusters) { i =>
    +      val slice = samples.view(i * numSamples, (i + 1) * numSamples)
    +      val mean = {
    +        val v = new DenseVector(new Array[Double](numFeatures))
    +        var i = 0
    +        while (i < numSamples) {
    +          BLAS.axpy(1.0, slice(i), v)
    +          i += 1
    +        }
    +        BLAS.scal(1.0 / numSamples, v)
    +        v
    +      }
    +      /*
    +         Construct matrix where diagonal entries are element-wise
    +         variance of input vectors (computes biased variance).
    +         Since the covariance matrix of multivariate gaussian distribution 
is symmetric,
    +         only the upper triangular part of the matrix will be saved as a 
dense vector
    +         in order to reduce the shuffled data size.
    +       */
    +      val cov = {
    +        val ss = new DenseVector(new Array[Double](numFeatures)).asBreeze
    +        slice.foreach(xi => ss += (xi.asBreeze - mean.asBreeze) :^ 2.0)
    +        val diagVec = Vectors.fromBreeze(ss)
    +        BLAS.scal(1.0 / numSamples, diagVec)
    +        val covVec = new DenseVector(Array.fill[Double](
    +          numFeatures * (numFeatures + 1) / 2)(0.0))
    +        diagVec.toArray.zipWithIndex.foreach { case (v: Double, i: Int) =>
    +          covVec.values(i + i * (i + 1) / 2) = v
    +        }
    +        covVec
    +      }
    +      (mean, cov)
    +    }
    +    (weights, gaussians)
    +  }
     }
     
     @Since("2.0.0")
     object GaussianMixture extends DefaultParamsReadable[GaussianMixture] {
     
       @Since("2.0.0")
       override def load(path: String): GaussianMixture = super.load(path)
    +
    +  /**
    +   * Heuristic to distribute the computation of the 
[[MultivariateGaussian]]s, approximately when
    +   * numFeatures > 25 except for when numClusters is very small.
    +   *
    +   * @param numClusters  Number of clusters
    +   * @param numFeatures  Number of features
    +   */
    +  private[clustering] def shouldDistributeGaussians(
    +      numClusters: Int,
    +      numFeatures: Int): Boolean = {
    +    ((numClusters - 1.0) / numClusters) * numFeatures > 25.0
    +  }
    +
    +  /**
    +   * Convert an n * (n + 1) / 2 dimension array representing the upper 
triangular part of a matrix
    +   * into an n * n array representing the full symmetric matrix.
    +   *
    +   * @param n The order of the n by n matrix.
    +   * @param triangularValues The upper triangular part of the matrix 
packed in an array
    +   *                         (column major).
    +   * @return An array which represents the symmetric matrix in column 
major.
    +   */
    +  private[clustering] def unpackUpperTriangularMatrix(
    --- End diff --
    
    You always use this right away by converting it to a DenseMatrix, so how 
about just returning a DenseMatrix?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15413: [SPARK-17847][ML] Reduce shuffled data size of Ga...

Reply via email to