Github user johnc1231 commented on a diff in the pull request: https://github.com/apache/spark/pull/17459#discussion_r109320642 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.scala --- @@ -113,6 +114,67 @@ class IndexedRowMatrix @Since("1.0.0") ( } /** + * Converts to BlockMatrix. Creates blocks of `DenseMatrix` with size 1024 x 1024. + */ + def toBlockMatrixDense(): BlockMatrix = { + toBlockMatrixDense(1024, 1024) + } + + /** + * Converts to BlockMatrix. Creates blocks of `DenseMatrix`. + * @param rowsPerBlock The number of rows of each block. The blocks at the bottom edge may have + * a smaller value. Must be an integer value greater than 0. + * @param colsPerBlock The number of columns of each block. The blocks at the right edge may have + * a smaller value. Must be an integer value greater than 0. + * @return a [[BlockMatrix]] + */ + def toBlockMatrixDense(rowsPerBlock: Int, colsPerBlock: Int): BlockMatrix = { + require(rowsPerBlock > 0, + s"rowsPerBlock needs to be greater than 0. rowsPerBlock: $rowsPerBlock") + require(colsPerBlock > 0, + s"colsPerBlock needs to be greater than 0. colsPerBlock: $colsPerBlock") + + val m = numRows() + val n = numCols() + val lastRowBlockIndex = m / rowsPerBlock + val lastColBlockIndex = n / colsPerBlock + val lastRowBlockSize = (m % rowsPerBlock).toInt + val lastColBlockSize = (n % colsPerBlock).toInt + val numRowBlocks = math.ceil(m.toDouble / rowsPerBlock).toInt + val numColBlocks = math.ceil(n.toDouble / colsPerBlock).toInt + + val blocks: RDD[((Int, Int), Matrix)] = rows.flatMap({ ir => + val blockRow = ir.index / rowsPerBlock + val rowInBlock = ir.index % rowsPerBlock + + ir.vector.toArray + .grouped(colsPerBlock) + .zipWithIndex + .map({ case (values, blockColumn) => + ((blockRow.toInt, blockColumn), (rowInBlock.toInt, values)) + }) + }).groupByKey(GridPartitioner(numRowBlocks, numColBlocks, rowsPerBlock, colsPerBlock)).map({ --- End diff -- You're right. My code makes the assumption that there is a single block per partition, which is incorrect. Thanks for that.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org