Hi Sourav,
Your understanding is correct, X and Y can be supplied either as a file or
as a RDD/DataFrame. Each of these two mechanisms has its own benefits. The
former mechanism (i.e. passing as file) pushes the reading/reblocking into
the optimizer, while the latter mechanism allows for preprocessing of data
(for example: using Spark SQL).
Two use-cases when X and Y are supplied as files on HDFS:
1. Command-line invocation: $SPARK_HOME/bin/spark-submit --master ....
SystemML.jar -f GLM.dml -nvargs dfam=2 link=2 yneg=-1.0 icpt=2 reg=0.001
tol=0.00000001 disp=1.0 moi=100 mii=10 X=INPUT_DIR/X Y=INPUT_DIR/Y
B=OUTPUT_DIR/betas fmt=csv O=OUTPUT_DIR/stats Log=OUTPUT_DIR/log
2. Using MLContext but without registering X and Y as input. Instead we
pass filenames as command-line parameters:
> val ml = new MLContext(sc)
> val cmdLineParams = Map("X"->"INPUT_DIR/X", "Y"- > "INPUT_DIR/Y", "dfam"
-> "2", "link" -> "2", ...)
> ml.execute("GLM.dml", cmdLineParams)
As mentioned earlier, X and Y can be provided as RDD/DataFrame as well.
> val ml = new MLContext(sc)
> ml.registerInput("X", xDF)
> ml.registerInput("Y", yDF)
> val cmdLineParams = Map("X"->" ", "Y"- > " ", "B" -> " ", "dfam" -> "2",
"link" -> "2", ...)
> ml.execute("GLM.dml", cmdLineParams)
One important thing that I must point is the concept of "ifdef". It is
explained in the section
http://apache.github.io/incubator-systemml/dml-language-reference.html#command-line-arguments
.
Here is snippet from the DML script for GLM:
https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/GLM.dml
fileX = $X;
fileY = $Y;
fileO = ifdef ($O, " ");
fmtB = ifdef ($fmt, "text");
distribution_type = ifdef ($dfam, 1);
The above DML code essentially says $X and $Y are required parameters (a
design decision that GLM script writer made), whereas $fmt and $dfam are
optional as they are assigned default values when not explicitly provided.
Both these constructs are important tools in the arsenal of DML script
writer. By not guarding a dollar parameter with ifdef, the DML script
writer ensures that the user has to provide its value (in this case file
names for X and Y). This is why, you will notice that I have provide a
space for X, Y and B in the second MLContext snippet.
Thanks,
Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
From: Sourav Mazumder <[email protected]>
To: [email protected]
Date: 12/07/2015 07:30 PM
Subject: Using GLM with Spark
Hi,
Trying to use GLM with Spark.
I go through the documentation of the same in
http://apache.github.io/incubator-systemml/algorithms-regression.html#generalized-linear-models
I see that inputs like X and Y have to supplied using a file and the file
has to be there in HDFS.
Is this understanding correct ? Can't X and Y be supplied using a Data
Frame from a Spark Context (as in case of example of LinearRegression in
http://apache.github.io/incubator-systemml/mlcontext-programming-guide.html#train-using-systemml-linear-regression-algorithm
)
?
Regards,
Sourav