Hi Niketan/Deron, Thanks for the inputs. Let me dig little deeper using these inputs. Shall get back to you in case I have more Qs.
Regards, Sourav On Mon, Dec 7, 2015 at 4:28 PM, Deron Eriksson <[email protected]> wrote: > Thank you Niketan for providing such useful information. The > RDDConverterUtilsExt javadoc example is great. > > The MLContext API has a tremendous amount of potential given that is has > such clean integration with Spark (for example, it's so easy to create an > MLContext from a SparkContext in the Spark Shell). I'm really interested in > seeing how data scientists and developers embrace it in the coming months. > > > Deron > > > > On Mon, Dec 7, 2015 at 3:31 PM, Niketan Pansare <[email protected]> > wrote: > > > Thanks Deron for your response :) > > > > Sourav: Few additional comments: > > 1. MLContext allows the users to pass RDDs to SystemML and MLOutput > allows > > them to fetch the result RDD after the execution of a DML script. > > > > 2. MLContext exposes registerInput("variableName", RDD) interface, while > > MLOutput has get..("variableName") methods. Eg: getDF, > getBinaryBlockedRDD, > > ... > > > > 3. With exception of DataFrame, the RDDs supported by these classes > mirror > > the RDDs in the symbol table and the format supported by read()/write() > > built-in functions. Following types of RDDs are supported by these > classes: > > a. Binary blocked RDD (JavaPairRDD<MatrixIndexes, MatrixBlock>) => > > corresponds to format="binary" > > b. String-based RDD (JavaRDD<String>) => corresponds to format="csv" or > > format="text" > > c. DataFrame > > > > See > > > http://apache.github.io/incubator-systemml/dml-language-reference.html#readwrite-built-in-functions > > for more details about the formats supported by read()/write() built-in > > functions. > > > > 4. For all other types of RDDs, we decided to expose them through > > converter utils: > > > > > https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/utils/RDDConverterUtils.java > > > > > https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/utils/RDDConverterUtilsExt.java > > > > 5. The utility functions in RDDConverterUtilsExt are not tested for > > performance and robustness. Once they are tested, they will be moved into > > RDDConverterUtils. Most of these utils have javadocs within the code and > we > > will add both usage guide and external javadoc for them. Following types > of > > conversions are supported by the converter utils: > > a. CoordinateMatrix to Binary blocked RDD (See > > coordinateMatrixToBinaryBlock in RDDConverterUtilsExt). > > b. Binary blocked RDD to String RDD. > > c. DataFrame with a column with Vector UDT to Binary Block and viceversa. > > This is useful while working with RDD<LabelPoints>. (See > > vectorDataFrameToBinaryBlock and binaryBlockToVectorDataFrame in > > RDDConverterUtilsExt). > > d. DataFrame with double columns (See dataFrameToBinaryBlock in > > RDDConverterUtilsExt). Since DataFrame/RDD is a collection not a > > indexed/ordered sequence (at least not at API level), an ID column is > > inserted by MLOutput to denote the row index. > > e. Binary block to Labeled points (See binaryBlockToLabeledPoints in > > RDDConverterUtils). > > f. Conversion between text/cell/csv formats to and from Binary blocked > RDD > > (See RDDConverterUtils). > > > > 6. MLContext interface is Scala compatible i.e. we support both JavaRDD > > and RDD, JavaSparkContext and SparkContext, java.util.HashMap and > > scala.collection.immutable.Map, and so on. > > > > 7. MatrixCharacteristics is used to provide the metadata information > (such > > as number of rows, number of columns, block row length, block column > length > > and number of non-zeros) of a RDD to the SystemML's optimizer. In some > > cases, it is required (for example: text, binary format) while in some > > cases, it can be skipped (for example: csv, dataframe). MLContext exposes > > convenient wrappers such as *void* registerInput(String varName, > > JavaPairRDD<MatrixIndexes,MatrixBlock> rdd, *long* rlen, *long* clen, > > *int* brlen, *int* bclen) to avoid creating MatrixCharacteristics. Here > > is the source code if you are interested: > > > https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/matrix/MatrixCharacteristics.java > > > > A good example of using MatrixCharacteristics and converter utils is > > provided in RDDConverterUtilsExt's javadoc: > > * import > > org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtilsExt > > * import org.apache.sysml.runtime.matrix.MatrixCharacteristics > > * import org.apache.spark.api.java.JavaSparkContext > > * import org.apache.spark.mllib.linalg.distributed.MatrixEntry > > * import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix > > * *val* matRDD = sc.textFile("ratings.text").map(_.split(" ")).map(x => > > new MatrixEntry(x(0).toLong, x(1).toLong, x(2).toDouble)).filter(_.value > != > > 0).cache > > * require(matRDD.filter(x => x.i == 0 || x.j == 0).count == 0, "Expected > 1 > > -based ratings file") > > * *val* *nnz* = matRDD.count > > * *val* numRows = matRDD.map(_.i).max > > * *val* numCols = matRDD.map(_.j).max > > * *val* coordinateMatrix = new CoordinateMatrix(matRDD, numRows, numCols) > > * *val* *mc* = new MatrixCharacteristics(numRows, numCols, 1000, 1000, > > *nnz*) > > * *val* binBlocks = > > RDDConverterUtilsExt.coordinateMatrixToBinaryBlock(new JavaSparkContext( > > *sc*), coordinateMatrix, *mc*, true) > > > > > > Thanks, > > > > Niketan Pansare > > IBM Almaden Research Center > > E-mail: npansar At us.ibm.com > > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar > > > > [image: Inactive hide details for Deron Eriksson ---12/07/2015 02:50:30 > > PM---Hi Sourav, One way to generate Javadocs for the entire Sys]Deron > > Eriksson ---12/07/2015 02:50:30 PM---Hi Sourav, One way to generate > > Javadocs for the entire SystemML project is "mvn > > > > From: Deron Eriksson <[email protected]> > > To: [email protected] > > Date: 12/07/2015 02:50 PM > > Subject: Re: API documentation for SystemML > > ------------------------------ > > > > > > > > Hi Sourav, > > > > One way to generate Javadocs for the entire SystemML project is "mvn > > javadoc:javadoc". > > > > Unfortunately, classes such as MatrixCharacteristics and > RDDConverterUtils > > currently have very minimal API documentation. We are hoping to address > > this in the near future. However, you may find that the following > > documentation link could be of assistance in getting started, given your > > interest in Scala: > > > > > http://apache.github.io/incubator-systemml/mlcontext-programming-guide.html > > > > Deron > > > > > > On Mon, Dec 7, 2015 at 1:58 PM, Sourav Mazumder < > > [email protected] > > > wrote: > > > > > Hi, > > > > > > Is there any Scala/Java API documentation available for classes like > > > > > > MatrixCharacteristics, RDDConverterUtils ? > > > > > > What I need to understand is what all such helper utilities available > > > and the deatils of their signature/APIs. > > > > > > Regards, > > > > > > Sourav > > > > > > > > > >
