[jira] [Comment Edited] (SYSTEMML-1238) Python test failing for LinearRegCG

Niketan Pansare (JIRA) Thu, 16 Feb 2017 21:34:51 -0800

    [ 
https://issues.apache.org/jira/browse/SYSTEMML-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871170#comment-15871170
 ]


Niketan Pansare edited comment on SYSTEMML-1238 at 2/17/17 5:33 AM:
--------------------------------------------------------------------

I am able to reproduce this bug (not sure if it is) with command-line as well. 
Here is the output of GLM-predict (after running LinRegDS):
{code}
$ cat y_predicted.csv
189.09660701586185
133.3260601238074
157.3739106185465
132.8144037303023
135.88434209133283
154.81562865102103
194.2131709509127
136.3959984848379
125.13955782772601
137.41931127184807
178.35182275225503
123.60458864721075
152.7690030770007
141.0009060263837
116.95305553164462
161.46716176658717
144.58250078091928
144.58250078091928
170.67697684967874
117.4647119251497
{code}

Here is the output of Python mllearn:
{code}
>>> import numpy as np
>>> from pyspark.context import SparkContext
>>> from pyspark.ml import Pipeline
>>> from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.sql import SparkSession
from sklearn import datasets, metrics, neighbors
>>> from pyspark.sql import SparkSession
>>> from sklearn import datasets, metrics, neighbors
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

from systemml.mllearn import LinearRegression, LogisticRegression, NaiveBayes, 
SVM
diabetes = datasets.load_diabetes()
diabetes_X = diabetes.data[:, np.newaxis, 2]
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
sparkSession = SparkSession.builder.getOrCreate()
regr = LinearRegression(sparkSession, solver="direct-solve")
regr.fit(diabetes_X_train, diabetes_y_train)>>> from sklearn.datasets import 
fetch_20newsgroups
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>>
>>> from systemml.mllearn import LinearRegression, LogisticRegression, 
>>> NaiveBayes, SVM
>>> diabetes = datasets.load_diabetes()
>>> diabetes_X = diabetes.data[:, np.newaxis, 2]
>>> diabetes_X_train = diabetes_X[:-20]
>>> diabetes_X_test = diabetes_X[-20:]
>>> diabetes_y_train = diabetes.target[:-20]
>>> diabetes_y_test = diabetes.target[-20:]
>>> sparkSession = SparkSession.builder.getOrCreate()
>>> regr = LinearRegression(sparkSession, solver="direct-solve")
>>> regr.fit(diabetes_X_train, diabetes_y_train)

Welcome to Apache SystemML!

17/02/16 22:39:21 WARN RewriteRemovePersistentReadWrite: Non-registered 
persistent write of variable 'X' (line 87).
17/02/16 22:39:21 WARN RewriteRemovePersistentReadWrite: Non-registered 
persistent write of variable 'y' (line 88).
BEGIN LINEAR REGRESSION SCRIPT
Reading X and Y...
Calling the Direct Solver...
Computing the statistics...
AVG_TOT_Y,153.36255924170615
STDEV_TOT_Y,77.21853383600028
AVG_RES_Y,4.8020565933360324E-14
STDEV_RES_Y,67.06389890324985
DISPERSION,4497.566536105316
PLAIN_R2,0.24750834362605834
ADJUSTED_R2,0.24571669682516795
PLAIN_R2_NOBIAS,0.24750834362605834
ADJUSTED_R2_NOBIAS,0.24571669682516795
Writing the output matrix...
END LINEAR REGRESSION SCRIPT
lr
>>> regr.predict(diabetes_X_test)
17/02/16 22:39:35 WARN Expression: WARNING: null -- line 149, column 4 -- Read 
input file does not exist on FS (local mode):
17/02/16 22:39:35 WARN Expression: Metadata file:  .mtd not provided
array([[ 188.84521284],
       [ 134.98127765],
       [ 158.20701117],
       [ 134.4871131 ],
       [ 137.45210036],
       [ 155.73618846],
       [ 193.78685827],
       [ 137.94626491],
       [ 127.07464496],
       [ 138.93459399],
       [ 178.46775744],
       [ 125.59215133],
       [ 153.75953028],
       [ 142.39374579],
       [ 119.16801227],
       [ 162.16032752],
       [ 145.8528976 ],
       [ 145.8528976 ],
       [ 171.05528929],
       [ 119.66217681]])
{code}

To reproduce the command-line output, please dump the test data into csv:
{code}
import numpy as np
from sklearn import datasets
diabetes = datasets.load_diabetes()
diabetes_X = diabetes.data[:, np.newaxis, 2]
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
diabetes_X_test.tofile('X_test.csv', sep="\n")
diabetes_X.tofile('X.csv', sep="\n")
diabetes.target.tofile('y.csv', sep="\n")
{code}

And execute following commands (you may have to edit dml script to add format 
or create metadata file):
{code}
~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit SystemML.jar -f LinearRegDS.dml 
-nvargs X=X.csv Y=y.csv B=B.csv fmt=csv icpt=1 tol=0.000001 reg=1 
~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit SystemML.jar -f GLM-predict.dml 
-nvargs X=X_test.csv M=y_predicted.csv B=B.csv fmt=csv icpt=1 tol=0.000001 reg=1
{code}

I also tested using SystemML 0.12.0 and got the same predictions:
{code}
$ ~/spark-1.6.1-bin-hadoop2.6/bin/spark-submit systemml-0.12.0-incubating.jar 
-f LinearRegDS.dml -nvargs X=X.csv Y=y.csv B=B.csv fmt=csv icpt=1 tol=0.000001 
reg=1
$ ~/spark-1.6.1-bin-hadoop2.6/bin/spark-submit systemml-0.12.0-incubating.jar 
-f GLM-predict.dml -nvargs X=X_test.csv M=y_predicted.csv B=B.csv fmt=csv 
icpt=1 tol=0.000001 reg=1
$ cat y_predicted.csv
189.09660701586185
133.3260601238074
157.3739106185465
132.8144037303023
135.88434209133283
154.81562865102103
194.2131709509127
136.3959984848379
125.13955782772601
137.41931127184807
178.35182275225503
123.60458864721075
152.7690030770007
141.0009060263837
116.95305553164462
161.46716176658717
144.58250078091928
144.58250078091928
170.67697684967874
117.4647119251497
{code}


was (Author: niketanpansare):
I am able to reproduce this bug with command-line as well. Here is the output 
of GLM-predict (after running LinRegDS):
{code}
$ cat y_predicted.csv
189.09660701586185
133.3260601238074
157.3739106185465
132.8144037303023
135.88434209133283
154.81562865102103
194.2131709509127
136.3959984848379
125.13955782772601
137.41931127184807
178.35182275225503
123.60458864721075
152.7690030770007
141.0009060263837
116.95305553164462
161.46716176658717
144.58250078091928
144.58250078091928
170.67697684967874
117.4647119251497
{code}

Here is the output of Python mllearn:
{code}
>>> import numpy as np
>>> from pyspark.context import SparkContext
>>> from pyspark.ml import Pipeline
>>> from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.sql import SparkSession
from sklearn import datasets, metrics, neighbors
>>> from pyspark.sql import SparkSession
>>> from sklearn import datasets, metrics, neighbors
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

from systemml.mllearn import LinearRegression, LogisticRegression, NaiveBayes, 
SVM
diabetes = datasets.load_diabetes()
diabetes_X = diabetes.data[:, np.newaxis, 2]
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
sparkSession = SparkSession.builder.getOrCreate()
regr = LinearRegression(sparkSession, solver="direct-solve")
regr.fit(diabetes_X_train, diabetes_y_train)>>> from sklearn.datasets import 
fetch_20newsgroups
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>>
>>> from systemml.mllearn import LinearRegression, LogisticRegression, 
>>> NaiveBayes, SVM
>>> diabetes = datasets.load_diabetes()
>>> diabetes_X = diabetes.data[:, np.newaxis, 2]
>>> diabetes_X_train = diabetes_X[:-20]
>>> diabetes_X_test = diabetes_X[-20:]
>>> diabetes_y_train = diabetes.target[:-20]
>>> diabetes_y_test = diabetes.target[-20:]
>>> sparkSession = SparkSession.builder.getOrCreate()
>>> regr = LinearRegression(sparkSession, solver="direct-solve")
>>> regr.fit(diabetes_X_train, diabetes_y_train)

Welcome to Apache SystemML!

17/02/16 22:39:21 WARN RewriteRemovePersistentReadWrite: Non-registered 
persistent write of variable 'X' (line 87).
17/02/16 22:39:21 WARN RewriteRemovePersistentReadWrite: Non-registered 
persistent write of variable 'y' (line 88).
BEGIN LINEAR REGRESSION SCRIPT
Reading X and Y...
Calling the Direct Solver...
Computing the statistics...
AVG_TOT_Y,153.36255924170615
STDEV_TOT_Y,77.21853383600028
AVG_RES_Y,4.8020565933360324E-14
STDEV_RES_Y,67.06389890324985
DISPERSION,4497.566536105316
PLAIN_R2,0.24750834362605834
ADJUSTED_R2,0.24571669682516795
PLAIN_R2_NOBIAS,0.24750834362605834
ADJUSTED_R2_NOBIAS,0.24571669682516795
Writing the output matrix...
END LINEAR REGRESSION SCRIPT
lr
>>> regr.predict(diabetes_X_test)
17/02/16 22:39:35 WARN Expression: WARNING: null -- line 149, column 4 -- Read 
input file does not exist on FS (local mode):
17/02/16 22:39:35 WARN Expression: Metadata file:  .mtd not provided
array([[ 188.84521284],
       [ 134.98127765],
       [ 158.20701117],
       [ 134.4871131 ],
       [ 137.45210036],
       [ 155.73618846],
       [ 193.78685827],
       [ 137.94626491],
       [ 127.07464496],
       [ 138.93459399],
       [ 178.46775744],
       [ 125.59215133],
       [ 153.75953028],
       [ 142.39374579],
       [ 119.16801227],
       [ 162.16032752],
       [ 145.8528976 ],
       [ 145.8528976 ],
       [ 171.05528929],
       [ 119.66217681]])
{code}

To reproduce the command-line output, please dump the test data into csv:
{code}
import numpy as np
from sklearn import datasets
diabetes = datasets.load_diabetes()
diabetes_X = diabetes.data[:, np.newaxis, 2]
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
diabetes_X_test.tofile('X_test.csv', sep="\n")
diabetes_X.tofile('X.csv', sep="\n")
diabetes.target.tofile('y.csv', sep="\n")
{code}

And execute following commands (you may have to edit dml script to add format 
or create metadata file):
{code}
~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit SystemML.jar -f LinearRegDS.dml 
-nvargs X=X.csv Y=y.csv B=B.csv fmt=csv icpt=1 tol=0.000001 reg=1 
~/spark-2.1.0-bin-hadoop2.7/bin/spark-submit SystemML.jar -f GLM-predict.dml 
-nvargs X=X_test.csv M=y_predicted.csv B=B.csv fmt=csv icpt=1 tol=0.000001 reg=1
{code}

> Python test failing for LinearRegCG
> -----------------------------------
>
>                 Key: SYSTEMML-1238
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1238
>             Project: SystemML
>          Issue Type: Bug
>          Components: Algorithms, APIs
>    Affects Versions: SystemML 0.13
>            Reporter: Imran Younus
>            Assignee: Niketan Pansare
>         Attachments: python_LinearReg_test_spark.1.6.log, 
> python_LinearReg_test_spark.2.1.log
>
>
> [~deron] discovered that the one of the python test ({{test_mllearn_df.py}}) 
> with spark 2.1.0 was failing because the test score from linear regression 
> was very low ({{~ 0.24}}). I did a some investigation and it turns out the 
> the model parameters computed by the dml script are incorrect. In 
> systemml.12, the values of betas from linear regression model are 
> {{\[152.919, 938.237\]}}. This is what we expect from normal equation. (I 
> also tested this with sklearn). But the values of betas from systemml.13 
> (with spark 2.1.0) come out to be {{\[153.146, 458.489\]}}. These are not 
> correct and therefore the test score is much lower than expected. The data 
> going into DML script is correct. I printed out the valued of {{X}} and {{Y}} 
> in dml and I didn't see any issue there.
> Attached are the log files for two different tests (systemml0.12 and 0.13) 
> with explain flag.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (SYSTEMML-1238) Python test failing for LinearRegCG

Reply via email to