[ https://issues.apache.org/jira/browse/SYSTEMML-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871170#comment-15871170 ]
Niketan Pansare commented on SYSTEMML-1238: ------------------------------------------- I am able to reproduce this bug with command-line as well. Here is the output of GLM-predict (after running LinRegDS): {code} $ cat y_predicted.csv 189.09660701586185 133.3260601238074 157.3739106185465 132.8144037303023 135.88434209133283 154.81562865102103 194.2131709509127 136.3959984848379 125.13955782772601 137.41931127184807 178.35182275225503 123.60458864721075 152.7690030770007 141.0009060263837 116.95305553164462 161.46716176658717 144.58250078091928 144.58250078091928 170.67697684967874 117.4647119251497 {code} Here is the output of Python mllearn: {code} >>> import numpy as np >>> from pyspark.context import SparkContext >>> from pyspark.ml import Pipeline >>> from pyspark.ml.feature import HashingTF, Tokenizer from pyspark.sql import SparkSession from sklearn import datasets, metrics, neighbors >>> from pyspark.sql import SparkSession >>> from sklearn import datasets, metrics, neighbors from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import TfidfVectorizer from systemml.mllearn import LinearRegression, LogisticRegression, NaiveBayes, SVM diabetes = datasets.load_diabetes() diabetes_X = diabetes.data[:, np.newaxis, 2] diabetes_X_train = diabetes_X[:-20] diabetes_X_test = diabetes_X[-20:] diabetes_y_train = diabetes.target[:-20] diabetes_y_test = diabetes.target[-20:] sparkSession = SparkSession.builder.getOrCreate() regr = LinearRegression(sparkSession, solver="direct-solve") regr.fit(diabetes_X_train, diabetes_y_train)>>> from sklearn.datasets import fetch_20newsgroups >>> from sklearn.feature_extraction.text import TfidfVectorizer >>> >>> from systemml.mllearn import LinearRegression, LogisticRegression, >>> NaiveBayes, SVM >>> diabetes = datasets.load_diabetes() >>> diabetes_X = diabetes.data[:, np.newaxis, 2] >>> diabetes_X_train = diabetes_X[:-20] >>> diabetes_X_test = diabetes_X[-20:] >>> diabetes_y_train = diabetes.target[:-20] >>> diabetes_y_test = diabetes.target[-20:] >>> sparkSession = SparkSession.builder.getOrCreate() >>> regr = LinearRegression(sparkSession, solver="direct-solve") >>> regr.fit(diabetes_X_train, diabetes_y_train) Welcome to Apache SystemML! 17/02/16 22:39:21 WARN RewriteRemovePersistentReadWrite: Non-registered persistent write of variable 'X' (line 87). 17/02/16 22:39:21 WARN RewriteRemovePersistentReadWrite: Non-registered persistent write of variable 'y' (line 88). BEGIN LINEAR REGRESSION SCRIPT Reading X and Y... Calling the Direct Solver... Computing the statistics... AVG_TOT_Y,153.36255924170615 STDEV_TOT_Y,77.21853383600028 AVG_RES_Y,4.8020565933360324E-14 STDEV_RES_Y,67.06389890324985 DISPERSION,4497.566536105316 PLAIN_R2,0.24750834362605834 ADJUSTED_R2,0.24571669682516795 PLAIN_R2_NOBIAS,0.24750834362605834 ADJUSTED_R2_NOBIAS,0.24571669682516795 Writing the output matrix... END LINEAR REGRESSION SCRIPT lr >>> regr.predict(diabetes_X_test) 17/02/16 22:39:35 WARN Expression: WARNING: null -- line 149, column 4 -- Read input file does not exist on FS (local mode): 17/02/16 22:39:35 WARN Expression: Metadata file: .mtd not provided array([[ 188.84521284], [ 134.98127765], [ 158.20701117], [ 134.4871131 ], [ 137.45210036], [ 155.73618846], [ 193.78685827], [ 137.94626491], [ 127.07464496], [ 138.93459399], [ 178.46775744], [ 125.59215133], [ 153.75953028], [ 142.39374579], [ 119.16801227], [ 162.16032752], [ 145.8528976 ], [ 145.8528976 ], [ 171.05528929], [ 119.66217681]]) {code} To reproduce the command-line output, please dump the test data into csv: {code} import numpy as np from sklearn import datasets diabetes = datasets.load_diabetes() diabetes_X = diabetes.data[:, np.newaxis, 2] diabetes_X_train = diabetes_X[:-20] diabetes_X_test = diabetes_X[-20:] diabetes_y_train = diabetes.target[:-20] diabetes_y_test = diabetes.target[-20:] diabetes_X_test.tofile('X_test.csv', sep="\n") diabetes_X.tofile('X.csv', sep="\n") diabetes.target.tofile('y.csv', sep="\n") {code} And execute following commands (you may have to edit dml script to add format or create metadata file): {code} ~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit SystemML.jar -f LinearRegDS.dml -nvargs X=X.csv Y=y.csv B=B.csv fmt=csv icpt=1 tol=0.000001 reg=1 ~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit SystemML.jar -f GLM-predict.dml -nvargs X=X_test.csv M=y_predicted.csv B=B.csv fmt=csv icpt=1 tol=0.000001 reg=1 {code} > Python test failing for LinearRegCG > ----------------------------------- > > Key: SYSTEMML-1238 > URL: https://issues.apache.org/jira/browse/SYSTEMML-1238 > Project: SystemML > Issue Type: Bug > Components: Algorithms, APIs > Affects Versions: SystemML 0.13 > Reporter: Imran Younus > Assignee: Niketan Pansare > Attachments: python_LinearReg_test_spark.1.6.log, > python_LinearReg_test_spark.2.1.log > > > [~deron] discovered that the one of the python test ({{test_mllearn_df.py}}) > with spark 2.1.0 was failing because the test score from linear regression > was very low ({{~ 0.24}}). I did a some investigation and it turns out the > the model parameters computed by the dml script are incorrect. In > systemml.12, the values of betas from linear regression model are > {{\[152.919, 938.237\]}}. This is what we expect from normal equation. (I > also tested this with sklearn). But the values of betas from systemml.13 > (with spark 2.1.0) come out to be {{\[153.146, 458.489\]}}. These are not > correct and therefore the test score is much lower than expected. The data > going into DML script is correct. I printed out the valued of {{X}} and {{Y}} > in dml and I didn't see any issue there. > Attached are the log files for two different tests (systemml0.12 and 0.13) > with explain flag. -- This message was sent by Atlassian JIRA (v6.3.15#6346)