[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to ...

johnc1231 Sun, 02 Apr 2017 14:32:24 -0700

Github user johnc1231 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17459#discussion_r109320642
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala
 ---
    @@ -113,6 +114,67 @@ class IndexedRowMatrix @Since("1.0.0") (
       }
     
       /**
    +    * Converts to BlockMatrix. Creates blocks of `DenseMatrix` with size 
1024 x 1024.
    +    */
    +  def toBlockMatrixDense(): BlockMatrix = {
    +    toBlockMatrixDense(1024, 1024)
    +  }
    +
    +  /**
    +    * Converts to BlockMatrix. Creates blocks of `DenseMatrix`.
    +    * @param rowsPerBlock The number of rows of each block. The blocks at 
the bottom edge may have
    +    *                     a smaller value. Must be an integer value 
greater than 0.
    +    * @param colsPerBlock The number of columns of each block. The blocks 
at the right edge may have
    +    *                     a smaller value. Must be an integer value 
greater than 0.
    +    * @return a [[BlockMatrix]]
    +    */
    +  def toBlockMatrixDense(rowsPerBlock: Int, colsPerBlock: Int): 
BlockMatrix = {
    +    require(rowsPerBlock > 0,
    +      s"rowsPerBlock needs to be greater than 0. rowsPerBlock: 
$rowsPerBlock")
    +    require(colsPerBlock > 0,
    +      s"colsPerBlock needs to be greater than 0. colsPerBlock: 
$colsPerBlock")
    +
    +    val m = numRows()
    +    val n = numCols()
    +    val lastRowBlockIndex = m / rowsPerBlock
    +    val lastColBlockIndex = n / colsPerBlock
    +    val lastRowBlockSize = (m % rowsPerBlock).toInt
    +    val lastColBlockSize = (n % colsPerBlock).toInt
    +    val numRowBlocks = math.ceil(m.toDouble / rowsPerBlock).toInt
    +    val numColBlocks = math.ceil(n.toDouble / colsPerBlock).toInt
    +
    +    val blocks: RDD[((Int, Int), Matrix)] = rows.flatMap({ ir =>
    +      val blockRow = ir.index / rowsPerBlock
    +      val rowInBlock = ir.index % rowsPerBlock
    +
    +      ir.vector.toArray
    +        .grouped(colsPerBlock)
    +        .zipWithIndex
    +        .map({ case (values, blockColumn) =>
    +          ((blockRow.toInt, blockColumn), (rowInBlock.toInt, values))
    +        })
    +    }).groupByKey(GridPartitioner(numRowBlocks, numColBlocks, 
rowsPerBlock, colsPerBlock)).map({
    --- End diff --
    
    You're right. My code makes the assumption that there is a single block per 
partition, which is incorrect. Thanks for that.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17459: [SPARK-20109][MLlib] Added toBlockMatrixDense to ...

Reply via email to