Baunsgaard commented on a change in pull request #1071:
URL: https://github.com/apache/systemds/pull/1071#discussion_r496651094



##########
File path: scripts/builtin/lasso.dml
##########
@@ -0,0 +1,116 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+#uses the sparsa algorithm to perform lasso regression
+

Review comment:
       Please add some documentation of the function. You can look at something 
like the l2svm for inspiration.

##########
File path: scripts/builtin/lasso.dml
##########
@@ -0,0 +1,116 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+#uses the sparsa algorithm to perform lasso regression
+
+
+m_lasso = function(Matrix[Double] X, Matrix[Double] y) return(Matrix[Double] w)

Review comment:
       add an verbose flag, to enable and disable printing.

##########
File path: scripts/builtin/lasso.dml
##########
@@ -0,0 +1,116 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+#uses the sparsa algorithm to perform lasso regression
+
+
+m_lasso = function(Matrix[Double] X, Matrix[Double] y) return(Matrix[Double] w)
+    {
+    n = nrow(X)
+    m = ncol(X)
+
+
+    #params
+    tol = 10^(-15)
+    M = 5
+    tau = 1
+    maxiter = 1000
+
+    #constants

Review comment:
       also if it is appropriate add the constants, if they are intended to be 
modified. ... (constants are not usually intended to be but please consider for 
the individual cases here)

##########
File path: scripts/builtin/lasso.dml
##########
@@ -0,0 +1,116 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+#uses the sparsa algorithm to perform lasso regression

Review comment:
       just because i don't know, `sparsa` algorithm is a term? or a typo?

##########
File path: scripts/builtin/ppca.dml
##########
@@ -0,0 +1,154 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# This script performs Probabilistic Principal Component Analysis (PCA) on the 
given input data.
+# It is based on paper: sPCA: Scalable Principal Component Analysis for Big 
Data on Distributed
+# Platforms. Tarek Elgamal et.al.
+
+# INPUT PARAMETERS:
+# 
---------------------------------------------------------------------------------------------
+# NAME         TYPE   DEFAULT  MEANING
+# 
---------------------------------------------------------------------------------------------
+# X            String ---      location to read the matrix X input matrix
+# k            Int    ---      indicates dimension of the new vector space 
constructed from eigen vectors
+# tolobj       Int    0.00001  objective function tolerance value to stop ppca 
algorithm
+# tolrecerr    Int    0.02     reconstruction error tolerance value to stop 
the algorithm
+# iter         Int    10       maximum number of iterations
+# fmt          String 'text'   output format of results PPCA such as "text" or 
"csv"

Review comment:
       remove fmt argument

##########
File path: scripts/builtin/lasso.dml
##########
@@ -0,0 +1,116 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+#uses the sparsa algorithm to perform lasso regression
+
+
+m_lasso = function(Matrix[Double] X, Matrix[Double] y) return(Matrix[Double] w)
+    {

Review comment:
       double check the indentation.

##########
File path: scripts/builtin/lasso.dml
##########
@@ -0,0 +1,116 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+#uses the sparsa algorithm to perform lasso regression
+
+
+m_lasso = function(Matrix[Double] X, Matrix[Double] y) return(Matrix[Double] w)
+    {
+    n = nrow(X)
+    m = ncol(X)
+
+
+    #params

Review comment:
       please add these as parameters to the algorithm

##########
File path: scripts/builtin/ppca.dml
##########
@@ -0,0 +1,154 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# This script performs Probabilistic Principal Component Analysis (PCA) on the 
given input data.
+# It is based on paper: sPCA: Scalable Principal Component Analysis for Big 
Data on Distributed
+# Platforms. Tarek Elgamal et.al.
+
+# INPUT PARAMETERS:
+# 
---------------------------------------------------------------------------------------------
+# NAME         TYPE   DEFAULT  MEANING
+# 
---------------------------------------------------------------------------------------------
+# X            String ---      location to read the matrix X input matrix
+# k            Int    ---      indicates dimension of the new vector space 
constructed from eigen vectors
+# tolobj       Int    0.00001  objective function tolerance value to stop ppca 
algorithm
+# tolrecerr    Int    0.02     reconstruction error tolerance value to stop 
the algorithm
+# iter         Int    10       maximum number of iterations
+# fmt          String 'text'   output format of results PPCA such as "text" or 
"csv"
+# hadoop jar SystemDS.jar -f PPCA.dml -nvargs X=/INPUT_DIR/X  C=/OUTPUT_DIR/C 
V=/OUTPUT_DIR/V k=2 tol=0.2 iter=100
+# 
---------------------------------------------------------------------------------------------
+# OUTPUT PARAMETERS:
+# 
---------------------------------------------------------------------------------------------
+# NAME   TYPE   DEFAULT  MEANING
+# 
---------------------------------------------------------------------------------------------
+# C            Matrix  ---     principal components
+# V            Matrix  ---     eigenvalues / eigenvalues of principal 
components
+#
+
+
+m_pcaa = function(Matrix[Double] X, int iter = 10, double tolobj = 0.00001, 
double tolrecerr = 0.02, String fmt0 = "text") return(Matrix[Double] PC, 
Matrix[Double] V)
+    {
+    k = ncol(X)
+    n = nrow(X);
+    m = ncol(X);
+
+    #initializing principal components matrix
+    C =  rand(rows=m, cols=k, pdf="normal");
+    ss = rand(rows=1, cols=1, pdf="normal");
+    ss = as.scalar(ss);
+    ssPrev = ss;
+
+    # best selected principle components - with the lowest reconstruction error
+    PC = C;
+
+    # initilizing reconstruction error
+    RE = tolrecerr+1;
+    REBest = RE;
+
+    Z = matrix(0,rows=1,cols=1);
+
+    #Objective function value
+    ObjRelChng = tolobj+1;
+
+    # mean centered input matrix - dim -> [n,m]
+    Xm = X - colMeans(X);
+
+    #I -> k x k
+    ITMP = matrix(1,rows=k,cols=1);
+    I = diag(ITMP);
+
+    i = 0;
+    while (i < iter & ObjRelChng > tolobj & RE > tolrecerr){
+        #Estimation step - Covariance matrix
+        #M -> k x k
+        M = t(C) %*% C + I*ss;
+
+        #Auxilary matrix with n latent variables
+        # Z -> n x k
+        Z = Xm %*% (C %*% inv(M));
+
+        #ZtZ -> k x k
+        ZtZ = t(Z) %*% Z + inv(M)*ss;
+
+        #XtZ -> m x k
+        XtZ = t(Xm) %*% Z;
+
+        #Maximization step
+        #C ->  m x k
+        ZtZ_sum = sum(ZtZ); #+n*inv(M));
+        C = XtZ/ZtZ_sum;
+
+        #ss2 -> 1 x 1
+        ss2 = trace(ZtZ * (t(C) %*% C));
+
+        #ss3 -> 1 x 1
+        ss3 = sum((Z %*% t(C)) %*% t(Xm));
+
+        #Frobenius norm of reconstruction error -> Euclidean norm
+        #Fn -> 1 x 1
+        Fn = sum(Xm*Xm);
+
+        #ss -> 1 x 1
+        ss = (Fn + ss2 - 2*ss3)/(n*m);
+
+       #calculating objective function relative change
+       ObjRelChng = abs(1 - ss/ssPrev);
+       #print("Objective Relative Change: " + ObjRelChng + ", Objective: " + 
ss);

Review comment:
       double check indentation

##########
File path: scripts/builtin/ppca.dml
##########
@@ -0,0 +1,154 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# This script performs Probabilistic Principal Component Analysis (PCA) on the 
given input data.
+# It is based on paper: sPCA: Scalable Principal Component Analysis for Big 
Data on Distributed
+# Platforms. Tarek Elgamal et.al.
+
+# INPUT PARAMETERS:
+# 
---------------------------------------------------------------------------------------------
+# NAME         TYPE   DEFAULT  MEANING
+# 
---------------------------------------------------------------------------------------------
+# X            String ---      location to read the matrix X input matrix
+# k            Int    ---      indicates dimension of the new vector space 
constructed from eigen vectors
+# tolobj       Int    0.00001  objective function tolerance value to stop ppca 
algorithm
+# tolrecerr    Int    0.02     reconstruction error tolerance value to stop 
the algorithm
+# iter         Int    10       maximum number of iterations
+# fmt          String 'text'   output format of results PPCA such as "text" or 
"csv"
+# hadoop jar SystemDS.jar -f PPCA.dml -nvargs X=/INPUT_DIR/X  C=/OUTPUT_DIR/C 
V=/OUTPUT_DIR/V k=2 tol=0.2 iter=100

Review comment:
       hadoop is not used like this anymore, therefore remove this line.

##########
File path: 
src/test/java/org/apache/sysds/test/functions/builtin/BuiltinLassoTest.java
##########
@@ -0,0 +1,55 @@
+package org.apache.sysds.test.functions.builtin;

Review comment:
       missing license, therefore the tests on git fail.
   
   use `mvn package -P rat` and look at the file generated at `/target/rat.txt` 
to find these errors.

##########
File path: src/main/java/org/apache/sysds/common/Builtins.java
##########
@@ -152,6 +153,7 @@
        PCA("pca", true),
        PNMF("pnmf", true),
        PPRED("ppred", false),
+       PPCA("ppca", true),

Review comment:
       move one line up ... alphabetical order

##########
File path: scripts/builtin/ppca.dml
##########
@@ -0,0 +1,154 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# This script performs Probabilistic Principal Component Analysis (PCA) on the 
given input data.
+# It is based on paper: sPCA: Scalable Principal Component Analysis for Big 
Data on Distributed
+# Platforms. Tarek Elgamal et.al.
+
+# INPUT PARAMETERS:
+# 
---------------------------------------------------------------------------------------------
+# NAME         TYPE   DEFAULT  MEANING
+# 
---------------------------------------------------------------------------------------------
+# X            String ---      location to read the matrix X input matrix
+# k            Int    ---      indicates dimension of the new vector space 
constructed from eigen vectors
+# tolobj       Int    0.00001  objective function tolerance value to stop ppca 
algorithm
+# tolrecerr    Int    0.02     reconstruction error tolerance value to stop 
the algorithm
+# iter         Int    10       maximum number of iterations
+# fmt          String 'text'   output format of results PPCA such as "text" or 
"csv"
+# hadoop jar SystemDS.jar -f PPCA.dml -nvargs X=/INPUT_DIR/X  C=/OUTPUT_DIR/C 
V=/OUTPUT_DIR/V k=2 tol=0.2 iter=100
+# 
---------------------------------------------------------------------------------------------
+# OUTPUT PARAMETERS:
+# 
---------------------------------------------------------------------------------------------
+# NAME   TYPE   DEFAULT  MEANING
+# 
---------------------------------------------------------------------------------------------
+# C            Matrix  ---     principal components
+# V            Matrix  ---     eigenvalues / eigenvalues of principal 
components
+#
+
+
+m_pcaa = function(Matrix[Double] X, int iter = 10, double tolobj = 0.00001, 
double tolrecerr = 0.02, String fmt0 = "text") return(Matrix[Double] PC, 
Matrix[Double] V)

Review comment:
       remove the fmt argument

##########
File path: 
src/test/java/org/apache/sysds/test/functions/builtin/BuiltinLassoTest.java
##########
@@ -0,0 +1,55 @@
+package org.apache.sysds.test.functions.builtin;
+
+import org.apache.sysds.test.AutomatedTestBase;
+import org.apache.sysds.test.TestConfiguration;
+import org.junit.Test;
+
+import java.util.ArrayList;
+import java.util.List;
+
+
+public class BuiltinLassoTest extends AutomatedTestBase{
+
+    private final static String TEST_NAME = "lasso";
+    private final static String TEST_DIR = "functions/builtin/";
+    private final static String TEST_CLASS_DIR = TEST_DIR + 
BuiltinLassoTest.class.getSimpleName() + "/";
+
+    private final static int rows = 100;
+    private final static int cols = 10;
+
+    @Override
+    public void setUp(){
+        addTestConfiguration(TEST_NAME, new TestConfiguration(TEST_CLASS_DIR, 
TEST_NAME, new String[]{"B"}));
+    }
+
+    @Test
+    public void testLasso(){ runLassoTest(); }
+
+
+    private void runLassoTest(){
+
+        loadTestConfiguration(getTestConfiguration(TEST_NAME));
+        String HOME = SCRIPT_DIR + TEST_DIR;
+        fullDMLScriptName = HOME + TEST_NAME + ".dml";
+        List<String> proArgs = new ArrayList<>();
+
+
+        proArgs.add("-explain");
+        proArgs.add("-stats");
+        proArgs.add("-args");
+        proArgs.add(input("X"));
+        proArgs.add(input("y"));
+        proArgs.add(output("w"));
+        programArgs = proArgs.toArray(new String[proArgs.size()]);
+        double[][] X = getRandomMatrix(rows, cols, 0, 1, 0.8, -1);
+        double[][] y = getRandomMatrix(rows, 1, 0, 1, 0.8, -1);
+        writeInputMatrixWithMTD("X", X, true);
+        writeInputMatrixWithMTD("y", y, true);
+
+
+        runTest(true, EXCEPTION_NOT_EXPECTED, null, -1);
+

Review comment:
       Please add logic to verify that the algorithm runs correctly.
   The current code only execute the algorithm to see if it crashes. we need 
the test to also verify that the algorithm outputs something reasonable.
   
   You can engineer your input to make something that make sense in the output, 
based on the algorithm.

##########
File path: scripts/builtin/lasso.dml
##########
@@ -0,0 +1,116 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+#uses the sparsa algorithm to perform lasso regression

Review comment:
       if it is a term, i would need a link to some documentation or more 
information to be able to search for it efficiently.

##########
File path: src/test/scripts/functions/builtin/lasso.dml
##########
@@ -0,0 +1,25 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+X = read($1)
+y = read($2)
+w = lasso(X = X, y = y)
+write(w, $3)

Review comment:
       newline to make GitHub happy.

##########
File path: scripts/builtin/ppca.dml
##########
@@ -0,0 +1,154 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# This script performs Probabilistic Principal Component Analysis (PCA) on the 
given input data.
+# It is based on paper: sPCA: Scalable Principal Component Analysis for Big 
Data on Distributed
+# Platforms. Tarek Elgamal et.al.
+
+# INPUT PARAMETERS:
+# 
---------------------------------------------------------------------------------------------
+# NAME         TYPE   DEFAULT  MEANING
+# 
---------------------------------------------------------------------------------------------
+# X            String ---      location to read the matrix X input matrix
+# k            Int    ---      indicates dimension of the new vector space 
constructed from eigen vectors
+# tolobj       Int    0.00001  objective function tolerance value to stop ppca 
algorithm
+# tolrecerr    Int    0.02     reconstruction error tolerance value to stop 
the algorithm
+# iter         Int    10       maximum number of iterations
+# fmt          String 'text'   output format of results PPCA such as "text" or 
"csv"
+# hadoop jar SystemDS.jar -f PPCA.dml -nvargs X=/INPUT_DIR/X  C=/OUTPUT_DIR/C 
V=/OUTPUT_DIR/V k=2 tol=0.2 iter=100
+# 
---------------------------------------------------------------------------------------------
+# OUTPUT PARAMETERS:
+# 
---------------------------------------------------------------------------------------------
+# NAME   TYPE   DEFAULT  MEANING
+# 
---------------------------------------------------------------------------------------------
+# C            Matrix  ---     principal components
+# V            Matrix  ---     eigenvalues / eigenvalues of principal 
components
+#
+
+
+m_pcaa = function(Matrix[Double] X, int iter = 10, double tolobj = 0.00001, 
double tolrecerr = 0.02, String fmt0 = "text") return(Matrix[Double] PC, 
Matrix[Double] V)
+    {
+    k = ncol(X)
+    n = nrow(X);
+    m = ncol(X);
+
+    #initializing principal components matrix
+    C =  rand(rows=m, cols=k, pdf="normal");
+    ss = rand(rows=1, cols=1, pdf="normal");
+    ss = as.scalar(ss);
+    ssPrev = ss;
+
+    # best selected principle components - with the lowest reconstruction error
+    PC = C;
+
+    # initilizing reconstruction error
+    RE = tolrecerr+1;
+    REBest = RE;
+
+    Z = matrix(0,rows=1,cols=1);
+
+    #Objective function value
+    ObjRelChng = tolobj+1;
+
+    # mean centered input matrix - dim -> [n,m]
+    Xm = X - colMeans(X);
+
+    #I -> k x k
+    ITMP = matrix(1,rows=k,cols=1);
+    I = diag(ITMP);
+
+    i = 0;
+    while (i < iter & ObjRelChng > tolobj & RE > tolrecerr){
+        #Estimation step - Covariance matrix
+        #M -> k x k
+        M = t(C) %*% C + I*ss;
+
+        #Auxilary matrix with n latent variables
+        # Z -> n x k
+        Z = Xm %*% (C %*% inv(M));
+
+        #ZtZ -> k x k
+        ZtZ = t(Z) %*% Z + inv(M)*ss;
+
+        #XtZ -> m x k
+        XtZ = t(Xm) %*% Z;
+
+        #Maximization step
+        #C ->  m x k
+        ZtZ_sum = sum(ZtZ); #+n*inv(M));
+        C = XtZ/ZtZ_sum;
+
+        #ss2 -> 1 x 1
+        ss2 = trace(ZtZ * (t(C) %*% C));
+
+        #ss3 -> 1 x 1
+        ss3 = sum((Z %*% t(C)) %*% t(Xm));
+
+        #Frobenius norm of reconstruction error -> Euclidean norm
+        #Fn -> 1 x 1
+        Fn = sum(Xm*Xm);
+
+        #ss -> 1 x 1
+        ss = (Fn + ss2 - 2*ss3)/(n*m);
+
+       #calculating objective function relative change
+       ObjRelChng = abs(1 - ss/ssPrev);
+       #print("Objective Relative Change: " + ObjRelChng + ", Objective: " + 
ss);
+
+        #Reconstruction error
+        R = ((Z %*% t(C)) -  Xm);
+
+        #calculate the error
+        #TODO rethink calculation of reconstruction error ....
+        #1-Norm of reconstruction error - a big dense matrix
+        #RE -> n x m
+        RE = abs(sum(R)/sum(Xm));
+        if (RE < REBest){
+            PC = C;
+            REBest = RE;
+        }
+        #print("ss: " + ss +" = Fn( "+ Fn +" ) + ss2( " + ss2  +" ) - 2*ss3( " 
+ ss3 + " ), Reconstruction Error: " + RE);
+
+        ssPrev = ss;
+        i = i+1;
+    }
+    print("Objective Relative Change: " + ObjRelChng);
+    print ("Number of iterations: " + i + ", Reconstruction Err: " + REBest);
+
+    # reconstructs data
+    # RD -> n x k
+    RD = X %*% PC;
+
+    # calculate eigenvalues - principle component variance
+    RDMean = colMeans(RD);
+    V = t(colMeans(RD*RD) - (RDMean*RDMean));
+
+    # sorting eigenvalues and eigenvectors in decreasing order
+    V_decr_idx = order(target=V,by=1,decreasing=TRUE,index.return=TRUE);
+    VF_decr = table(seq(1,nrow(V)),V_decr_idx);
+    V = VF_decr %*% V;
+    PC = PC %*% VF_decr;
+
+    # writing principal components
+    # write(PC, fileC, format=fmt0);
+    # writing eigen values/pc variance
+    # write(V, fileV, format=fmt0);
+    }

Review comment:
       add a newline to make GitHub happy.

##########
File path: scripts/builtin/ppca.dml
##########
@@ -0,0 +1,154 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# This script performs Probabilistic Principal Component Analysis (PCA) on the 
given input data.
+# It is based on paper: sPCA: Scalable Principal Component Analysis for Big 
Data on Distributed
+# Platforms. Tarek Elgamal et.al.
+
+# INPUT PARAMETERS:
+# 
---------------------------------------------------------------------------------------------
+# NAME         TYPE   DEFAULT  MEANING
+# 
---------------------------------------------------------------------------------------------
+# X            String ---      location to read the matrix X input matrix
+# k            Int    ---      indicates dimension of the new vector space 
constructed from eigen vectors
+# tolobj       Int    0.00001  objective function tolerance value to stop ppca 
algorithm
+# tolrecerr    Int    0.02     reconstruction error tolerance value to stop 
the algorithm
+# iter         Int    10       maximum number of iterations
+# fmt          String 'text'   output format of results PPCA such as "text" or 
"csv"
+# hadoop jar SystemDS.jar -f PPCA.dml -nvargs X=/INPUT_DIR/X  C=/OUTPUT_DIR/C 
V=/OUTPUT_DIR/V k=2 tol=0.2 iter=100
+# 
---------------------------------------------------------------------------------------------
+# OUTPUT PARAMETERS:
+# 
---------------------------------------------------------------------------------------------
+# NAME   TYPE   DEFAULT  MEANING
+# 
---------------------------------------------------------------------------------------------
+# C            Matrix  ---     principal components
+# V            Matrix  ---     eigenvalues / eigenvalues of principal 
components
+#
+
+
+m_pcaa = function(Matrix[Double] X, int iter = 10, double tolobj = 0.00001, 
double tolrecerr = 0.02, String fmt0 = "text") return(Matrix[Double] PC, 
Matrix[Double] V)
+    {
+    k = ncol(X)
+    n = nrow(X);
+    m = ncol(X);
+
+    #initializing principal components matrix
+    C =  rand(rows=m, cols=k, pdf="normal");
+    ss = rand(rows=1, cols=1, pdf="normal");
+    ss = as.scalar(ss);
+    ssPrev = ss;
+
+    # best selected principle components - with the lowest reconstruction error
+    PC = C;
+
+    # initilizing reconstruction error
+    RE = tolrecerr+1;
+    REBest = RE;
+
+    Z = matrix(0,rows=1,cols=1);
+
+    #Objective function value
+    ObjRelChng = tolobj+1;
+
+    # mean centered input matrix - dim -> [n,m]
+    Xm = X - colMeans(X);
+
+    #I -> k x k
+    ITMP = matrix(1,rows=k,cols=1);
+    I = diag(ITMP);
+
+    i = 0;
+    while (i < iter & ObjRelChng > tolobj & RE > tolrecerr){
+        #Estimation step - Covariance matrix
+        #M -> k x k
+        M = t(C) %*% C + I*ss;
+
+        #Auxilary matrix with n latent variables
+        # Z -> n x k
+        Z = Xm %*% (C %*% inv(M));
+
+        #ZtZ -> k x k
+        ZtZ = t(Z) %*% Z + inv(M)*ss;
+
+        #XtZ -> m x k
+        XtZ = t(Xm) %*% Z;
+
+        #Maximization step
+        #C ->  m x k
+        ZtZ_sum = sum(ZtZ); #+n*inv(M));
+        C = XtZ/ZtZ_sum;
+
+        #ss2 -> 1 x 1
+        ss2 = trace(ZtZ * (t(C) %*% C));
+
+        #ss3 -> 1 x 1
+        ss3 = sum((Z %*% t(C)) %*% t(Xm));
+
+        #Frobenius norm of reconstruction error -> Euclidean norm
+        #Fn -> 1 x 1
+        Fn = sum(Xm*Xm);
+
+        #ss -> 1 x 1
+        ss = (Fn + ss2 - 2*ss3)/(n*m);
+
+       #calculating objective function relative change
+       ObjRelChng = abs(1 - ss/ssPrev);
+       #print("Objective Relative Change: " + ObjRelChng + ", Objective: " + 
ss);
+
+        #Reconstruction error
+        R = ((Z %*% t(C)) -  Xm);
+
+        #calculate the error
+        #TODO rethink calculation of reconstruction error ....
+        #1-Norm of reconstruction error - a big dense matrix
+        #RE -> n x m
+        RE = abs(sum(R)/sum(Xm));
+        if (RE < REBest){
+            PC = C;
+            REBest = RE;
+        }
+        #print("ss: " + ss +" = Fn( "+ Fn +" ) + ss2( " + ss2  +" ) - 2*ss3( " 
+ ss3 + " ), Reconstruction Error: " + RE);
+
+        ssPrev = ss;
+        i = i+1;
+    }
+    print("Objective Relative Change: " + ObjRelChng);
+    print ("Number of iterations: " + i + ", Reconstruction Err: " + REBest);
+
+    # reconstructs data
+    # RD -> n x k
+    RD = X %*% PC;
+
+    # calculate eigenvalues - principle component variance
+    RDMean = colMeans(RD);
+    V = t(colMeans(RD*RD) - (RDMean*RDMean));
+
+    # sorting eigenvalues and eigenvectors in decreasing order
+    V_decr_idx = order(target=V,by=1,decreasing=TRUE,index.return=TRUE);
+    VF_decr = table(seq(1,nrow(V)),V_decr_idx);
+    V = VF_decr %*% V;
+    PC = PC %*% VF_decr;
+
+    # writing principal components
+    # write(PC, fileC, format=fmt0);
+    # writing eigen values/pc variance
+    # write(V, fileV, format=fmt0);

Review comment:
       remove these write statement lines.

##########
File path: src/test/scripts/functions/builtin/PPCA.dml
##########
@@ -0,0 +1,25 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+X = read($1)
+PC = ppca(X = X)
+write(PC, V)
+

Review comment:
       PPCA outputs two variables.
   currently when you try to write this to an output we unfortunally do not 
throw an error, but also the write statement is simply ignored.
   Therefore this code will run fine. but ultimately be buggy.
   Please parse two results from PPCA, and write both out.

##########
File path: 
src/test/java/org/apache/sysds/test/functions/builtin/BuiltinPPCATest.java
##########
@@ -0,0 +1,49 @@
+package org.apache.sysds.test.functions.builtin;

Review comment:
       License

##########
File path: 
src/test/java/org/apache/sysds/test/functions/builtin/BuiltinPPCATest.java
##########
@@ -0,0 +1,49 @@
+package org.apache.sysds.test.functions.builtin;
+
+import org.apache.sysds.test.AutomatedTestBase;
+import org.apache.sysds.test.TestConfiguration;
+import org.junit.Test;
+
+import java.util.ArrayList;
+import java.util.List;
+
+public class BuiltinPPCATest extends AutomatedTestBase {
+    private final static String TEST_NAME = "PPCA";
+    private final static String TEST_DIR = "functions/builtin/";
+    private final static String TEST_CLASS_DIR = TEST_DIR + 
BuiltinPPCATest.class.getSimpleName() + "/";
+
+    private final static int rows = 100;
+    private final static int cols = 10;
+
+    @Override
+    public void setUp(){
+        addTestConfiguration(TEST_NAME, new TestConfiguration(TEST_CLASS_DIR, 
TEST_NAME, new String[]{"B"}));
+    }
+
+    @Test
+    public void testPPCA(){ runPPCATest(); }
+
+
+    private void runPPCATest(){
+
+        loadTestConfiguration(getTestConfiguration(TEST_NAME));
+        String HOME = SCRIPT_DIR + TEST_DIR;
+        fullDMLScriptName = HOME + TEST_NAME + ".dml";
+        List<String> proArgs = new ArrayList<>();
+
+        proArgs.add("-explain");
+        proArgs.add("-stats");
+        proArgs.add("-args");
+        proArgs.add(input("X"));
+        proArgs.add(output("PC"));
+        proArgs.add(output("V"));
+        programArgs = proArgs.toArray(new String[proArgs.size()]);
+        double[][] X = getRandomMatrix(rows, cols, 0, 1, 0.8, -1);
+        writeInputMatrixWithMTD("X", X, true);
+
+
+        runTest(true, EXCEPTION_NOT_EXPECTED, null, -1);
+

Review comment:
       same points as the other test.

##########
File path: 
src/test/java/org/apache/sysds/test/functions/builtin/BuiltinLassoTest.java
##########
@@ -0,0 +1,55 @@
+package org.apache.sysds.test.functions.builtin;
+
+import org.apache.sysds.test.AutomatedTestBase;
+import org.apache.sysds.test.TestConfiguration;
+import org.junit.Test;
+
+import java.util.ArrayList;
+import java.util.List;
+
+
+public class BuiltinLassoTest extends AutomatedTestBase{
+
+    private final static String TEST_NAME = "lasso";
+    private final static String TEST_DIR = "functions/builtin/";
+    private final static String TEST_CLASS_DIR = TEST_DIR + 
BuiltinLassoTest.class.getSimpleName() + "/";
+
+    private final static int rows = 100;
+    private final static int cols = 10;
+
+    @Override
+    public void setUp(){
+        addTestConfiguration(TEST_NAME, new TestConfiguration(TEST_CLASS_DIR, 
TEST_NAME, new String[]{"B"}));
+    }
+
+    @Test
+    public void testLasso(){ runLassoTest(); }
+
+
+    private void runLassoTest(){
+
+        loadTestConfiguration(getTestConfiguration(TEST_NAME));
+        String HOME = SCRIPT_DIR + TEST_DIR;
+        fullDMLScriptName = HOME + TEST_NAME + ".dml";
+        List<String> proArgs = new ArrayList<>();
+
+
+        proArgs.add("-explain");
+        proArgs.add("-stats");

Review comment:
       please remove -explain and -stats once you are done with checking the 
algorithm and upgrading the test.

##########
File path: scripts/builtin/ppca.dml
##########
@@ -0,0 +1,154 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# This script performs Probabilistic Principal Component Analysis (PCA) on the 
given input data.
+# It is based on paper: sPCA: Scalable Principal Component Analysis for Big 
Data on Distributed
+# Platforms. Tarek Elgamal et.al.
+
+# INPUT PARAMETERS:
+# 
---------------------------------------------------------------------------------------------
+# NAME         TYPE   DEFAULT  MEANING
+# 
---------------------------------------------------------------------------------------------
+# X            String ---      location to read the matrix X input matrix
+# k            Int    ---      indicates dimension of the new vector space 
constructed from eigen vectors
+# tolobj       Int    0.00001  objective function tolerance value to stop ppca 
algorithm
+# tolrecerr    Int    0.02     reconstruction error tolerance value to stop 
the algorithm
+# iter         Int    10       maximum number of iterations
+# fmt          String 'text'   output format of results PPCA such as "text" or 
"csv"
+# hadoop jar SystemDS.jar -f PPCA.dml -nvargs X=/INPUT_DIR/X  C=/OUTPUT_DIR/C 
V=/OUTPUT_DIR/V k=2 tol=0.2 iter=100
+# 
---------------------------------------------------------------------------------------------
+# OUTPUT PARAMETERS:
+# 
---------------------------------------------------------------------------------------------
+# NAME   TYPE   DEFAULT  MEANING
+# 
---------------------------------------------------------------------------------------------
+# C            Matrix  ---     principal components
+# V            Matrix  ---     eigenvalues / eigenvalues of principal 
components
+#
+
+
+m_pcaa = function(Matrix[Double] X, int iter = 10, double tolobj = 0.00001, 
double tolrecerr = 0.02, String fmt0 = "text") return(Matrix[Double] PC, 
Matrix[Double] V)

Review comment:
       move return to new line




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to