[SYSTEMML-947] Remove binary block classes from MLContext Remove BinaryBlockMatrix and BinaryBlockMatrix classes from MLContext API and incorporate similar functionality into Matrix and Frame classes.
Closes #531. Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/c44f6c02 Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/c44f6c02 Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/c44f6c02 Branch: refs/heads/gh-pages Commit: c44f6c0224fbbabb3804f91b00c039546f1dabaf Parents: 49bd822 Author: Deron Eriksson <de...@us.ibm.com> Authored: Wed Jun 7 10:28:44 2017 -0700 Committer: Deron Eriksson <de...@us.ibm.com> Committed: Wed Jun 7 10:28:44 2017 -0700 ---------------------------------------------------------------------- spark-mlcontext-programming-guide.md | 23 ++++++++--------------- 1 file changed, 8 insertions(+), 15 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/c44f6c02/spark-mlcontext-programming-guide.md ---------------------------------------------------------------------- diff --git a/spark-mlcontext-programming-guide.md b/spark-mlcontext-programming-guide.md index c424c70..ddccde1 100644 --- a/spark-mlcontext-programming-guide.md +++ b/spark-mlcontext-programming-guide.md @@ -243,7 +243,7 @@ mean: Double = 0.49996223966662934 Many different types of input and output variables are automatically allowed. These types include `Boolean`, `Long`, `Double`, `String`, `Array[Array[Double]]`, `RDD<String>` and `JavaRDD<String>` -in `CSV` (dense) and `IJV` (sparse) formats, `DataFrame`, `BinaryBlockMatrix`, `Matrix`, and +in `CSV` (dense) and `IJV` (sparse) formats, `DataFrame`, `Matrix`, and `Frame`. RDDs and JavaRDDs are assumed to be CSV format unless MatrixMetadata is supplied indicating IJV format. @@ -1606,11 +1606,7 @@ Therefore, if you use a set of data multiple times, one way to potentially impro to convert it to a SystemML matrix representation and then use this representation rather than performing the data conversion each time. -There are currently two mechanisms for this in SystemML: **(1) BinaryBlockMatrix** and **(2) Matrix**. - -**BinaryBlockMatrix:** - -If you have an input DataFrame, it can be converted to a BinaryBlockMatrix, and this BinaryBlockMatrix +If you have an input DataFrame, it can be converted to a Matrix, and this Matrix can be passed as an input rather than passing in the DataFrame as an input. For example, suppose we had a 10000x100 matrix represented as a DataFrame, as we saw in an earlier example. @@ -1633,10 +1629,10 @@ val minMaxMeanScript = dml(minMaxMean).in("Xin", df, mm).out("minOut", "maxOut", {% endhighlight %} Rather than passing in a DataFrame each time to the Script object creation, let's instead create a -BinaryBlockMatrix object based on the DataFrame and pass this BinaryBlockMatrix to the Script object +Matrix object based on the DataFrame and pass this Matrix to the Script object creation. If we run the code below in the Spark Shell, we see that the data conversion step occurs -when the BinaryBlockMatrix object is created. However, when we create a Script object twice, we see -that no conversion penalty occurs, since this conversion occurred when the BinaryBlockMatrix was +when the Matrix object is created. However, when we create a Script object twice, we see +that no conversion penalty occurs, since this conversion occurred when the Matrix was created. {% highlight scala %} @@ -1649,14 +1645,11 @@ val data = sc.parallelize(0 to numRows-1).map { _ => Row.fromSeq(Seq.fill(numCol val schema = StructType((0 to numCols-1).map { i => StructField("C" + i, DoubleType, true) } ) val df = spark.createDataFrame(data, schema) val mm = new MatrixMetadata(numRows, numCols) -val bbm = new BinaryBlockMatrix(df, mm) -val minMaxMeanScript = dml(minMaxMean).in("Xin", bbm).out("minOut", "maxOut", "meanOut") -val minMaxMeanScript = dml(minMaxMean).in("Xin", bbm).out("minOut", "maxOut", "meanOut") +val matrix = new Matrix(df, mm) +val minMaxMeanScript = dml(minMaxMean).in("Xin", matrix).out("minOut", "maxOut", "meanOut") +val minMaxMeanScript = dml(minMaxMean).in("Xin", matrix).out("minOut", "maxOut", "meanOut") {% endhighlight %} - -**Matrix:** - When a matrix is returned as an output, it is returned as a Matrix object, which is a wrapper around a SystemML MatrixObject. As a result, an output Matrix is already in a SystemML representation, meaning that it can be passed as an input with no data conversion penalty.