[GitHub] spark pull request: [SPARK-14533] [MLLIB] RowMatrix.computeCovaria...

mengxr Mon, 11 Apr 2016 10:16:32 -0700

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12299#discussion_r59243115
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala 
---
    @@ -328,43 +328,43 @@ class RowMatrix @Since("1.0.0") (
         val n = numCols().toInt
         checkNumColumns(n)
     
    -    val (m, mean) = rows.treeAggregate[(Long, BDV[Double])]((0L, 
BDV.zeros[Double](n)))(
    -      seqOp = (s: (Long, BDV[Double]), v: Vector) => (s._1 + 1L, s._2 += 
v.toBreeze),
    -      combOp = (s1: (Long, BDV[Double]), s2: (Long, BDV[Double])) =>
    -        (s1._1 + s2._1, s1._2 += s2._2)
    -    )
    -
    -    if (m <= 1) {
    -      sys.error(s"RowMatrix.computeCovariance called on matrix with only 
$m rows." +
    -        "  Cannot compute the covariance of a RowMatrix with <= 1 row.")
    -    }
    -    updateNumRows(m)
    -
    -    mean :/= m.toDouble
    -
    -    // We use the formula Cov(X, Y) = E[X * Y] - E[X] E[Y], which is not 
accurate if E[X * Y] is
    -    // large but Cov(X, Y) is small, but it is good for sparse computation.
    -    // TODO: find a fast and stable way for sparse data.
    +    val summary = computeColumnSummaryStatistics()
    +    val m = summary.count
    +    require(m > 1, s"RowMatrix.computeCovariance called on matrix with 
only $m rows." +
    +      "  Cannot compute the covariance of a RowMatrix with <= 1 row.")
    +    val mean = summary.mean.toBreeze
    +
    +    rows.first() match {
    +      case _: SparseVector =>
    +        // We use the formula Cov(X, Y) = E[X * Y] - E[X] E[Y], which is 
not accurate if E[X * Y] is
    +        // large but Cov(X, Y) is small, but it is good for sparse 
computation.
    +        // TODO: find a fast and stable way for sparse data.
    +        val G = computeGramianMatrix().toBreeze
    +        var i = 0
    +        var j = 0
    +        val m1 = m - 1.0
    +        var alpha = 0.0
    +        while (i < n) {
    +          alpha = m / m1 * mean(i)
    +          j = i
    +          while (j < n) {
    +            val Gij = G(i, j) / m1 - alpha * mean(j)
    +            G(i, j) = Gij
    +            G(j, i) = Gij
    +            j += 1
    +          }
    +          i += 1
    +        }
    +        Matrices.fromBreeze(G)
     
    -    val G = computeGramianMatrix().toBreeze.asInstanceOf[BDM[Double]]
    +      case _: DenseVector =>
    +        // For dense, go ahead and subtract off mean to avoid round-off 
problem above
    +        val centeredRows = rows.map(row => Vectors.fromBreeze(row.toBreeze 
- mean))
    --- End diff --
    
    This creates one temp vector per row, which could be expensive.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-14533] [MLLIB] RowMatrix.computeCovaria...

Reply via email to