[GitHub] [systemds] Baunsgaard commented on a change in pull request #1145: Decision Tree Feature

GitBox Sat, 09 Jan 2021 07:07:00 -0800


Baunsgaard commented on a change in pull request #1145:
URL: https://github.com/apache/systemds/pull/1145#discussion_r554435608




##########
File path: scripts/builtin/decisionTree.dml
##########
@@ -0,0 +1,518 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+#
+# THIS SCRIPT IMPLEMENTS CLASSIFICATION TREES WITH BOTH SCALE AND CATEGORICAL 
FEATURES
+#
+# INPUT         PARAMETERS:
+# 
---------------------------------------------------------------------------------------------
+# NAME          TYPE     DEFAULT      MEANING
+# 
---------------------------------------------------------------------------------------------
+# X             String   ---          Location to read feature matrix X; note 
that X needs to be both recoded and dummy coded
+# Y                    String   ---              Location to read label matrix 
Y; note that Y needs to be both recoded and dummy coded
+# R                    String   " "          Location to read the matrix R 
which for each feature in X contains the following information

Review comment:
       formatting issues with tabs and spaces.

##########
File path: 
src/test/java/org/apache/sysds/test/functions/builtin/BuiltinDecisionTreeTest.java
##########
@@ -0,0 +1,82 @@
+package org.apache.sysds.test.functions.builtin;

Review comment:
       add license

##########
File path: 
src/test/java/org/apache/sysds/test/functions/builtin/BuiltinDecisionTreeTest.java
##########
@@ -0,0 +1,82 @@
+package org.apache.sysds.test.functions.builtin;
+
+import org.apache.sysds.common.Types;
+import org.apache.sysds.lops.LopProperties;
+import org.apache.sysds.test.AutomatedTestBase;
+import org.apache.sysds.test.TestConfiguration;
+import org.junit.Test;
+
+public class BuiltinDecisionTreeTest extends AutomatedTestBase
+{
+    private final static String TEST_NAME = "decisionTree";
+    private final static String TEST_DIR = "functions/builtin/";
+    private static final String TEST_CLASS_DIR = TEST_DIR + 
BuiltinDecisionTreeTest.class.getSimpleName() + "/";
+
+    private final static double eps = 1e-10;
+    private final static int rows = 6;
+    private final static int cols = 4;
+
+    @Override
+    public void setUp() {
+        addTestConfiguration(TEST_NAME, new TestConfiguration(TEST_CLASS_DIR, 
TEST_NAME, new String[]{"C"}));
+    }
+
+    @Test
+    public void testDecisionTreeDefaultCP() { runDecisionTree(true, 
LopProperties.ExecType.CP); }
+
+    @Test
+    public void testDecisionTreeSP() {
+        runDecisionTree(true, LopProperties.ExecType.SPARK);
+    }
+
+    private void runDecisionTree(boolean defaultProb, LopProperties.ExecType 
instType)
+    {
+        Types.ExecMode platformOld = setExecMode(instType);
+
+        try
+        {
+            loadTestConfiguration(getTestConfiguration(TEST_NAME));
+
+            String HOME = SCRIPT_DIR + TEST_DIR;
+            fullDMLScriptName = HOME + TEST_NAME + ".dml";
+            programArgs = new String[]{"-args", input("X"), input("Y"), 
input("R"), output("M") };
+            fullRScriptName = HOME + TEST_NAME + ".R";
+            rCmd = "Rscript" + " " + fullRScriptName + " " + inputDir() + " "  
+ expectedDir();
+
+            double[][] Y = getRandomMatrix(rows, 1, 0, 1, 1.0, 3);
+            for (int row = 0; row < rows; row++) {
+                Y[row][0] = (Y[row][0] > 0.5)? 1.0 : 0.0;
+            }
+
+            //generate actual dataset
+            double[][] X = getRandomMatrix(rows, cols, 0, 100, 1.0, 7);
+            for (int row = 0; row < rows/2; row++) {
+                X[row][2] = (Y[row][0] > 0.5)? 2.0 : 1.0;
+                X[row][3] = 1.0;
+            }
+            for (int row = rows/2; row < rows; row++) {
+                X[row][2] = 1.0;
+                X[row][3] = (Y[row][0] > 0.5)? 2.0 : 1.0;
+            }
+            writeInputMatrixWithMTD("X", X, true);
+            writeInputMatrixWithMTD("Y", Y, true);
+
+
+
+            double[][] R = getRandomMatrix(1, cols, 1, 1, 1.0, 1);
+            R[0][3] = 3.0;
+            R[0][2] = 3.0;
+            writeInputMatrixWithMTD("R", R, true);
+
+            runTest(true, false, null, -1);
+
+//            runRScript(true);
+//            HashMap<MatrixValue.CellIndex, Double> dmlfile = 
readDMLMatrixFromOutputDir("C");
+//            HashMap<MatrixValue.CellIndex, Double> rfile  = 
readRMatrixFromExpectedDir("C");
+//            TestUtils.compareMatrices(dmlfile, rfile, eps, "Stat-DML", 
"Stat-R");
+        }
+        finally {
+            rtplatform = platformOld;
+        }
+    }
+}

Review comment:
       and add newline in the end of the files.

##########
File path: src/test/scripts/functions/builtin/decisionTree.R
##########
@@ -0,0 +1,33 @@
+# Title     : TODO
+# Objective : TODO
+# Created by: gaisberger
+# Created on: 27.11.20

Review comment:
       remove these first few lines before the license

##########
File path: 
src/test/java/org/apache/sysds/test/functions/builtin/BuiltinDecisionTreeTest.java
##########
@@ -0,0 +1,82 @@
+package org.apache.sysds.test.functions.builtin;
+
+import org.apache.sysds.common.Types;
+import org.apache.sysds.lops.LopProperties;
+import org.apache.sysds.test.AutomatedTestBase;
+import org.apache.sysds.test.TestConfiguration;
+import org.junit.Test;
+
+public class BuiltinDecisionTreeTest extends AutomatedTestBase
+{
+    private final static String TEST_NAME = "decisionTree";
+    private final static String TEST_DIR = "functions/builtin/";
+    private static final String TEST_CLASS_DIR = TEST_DIR + 
BuiltinDecisionTreeTest.class.getSimpleName() + "/";
+
+    private final static double eps = 1e-10;
+    private final static int rows = 6;
+    private final static int cols = 4;
+
+    @Override
+    public void setUp() {
+        addTestConfiguration(TEST_NAME, new TestConfiguration(TEST_CLASS_DIR, 
TEST_NAME, new String[]{"C"}));
+    }
+
+    @Test
+    public void testDecisionTreeDefaultCP() { runDecisionTree(true, 
LopProperties.ExecType.CP); }
+
+    @Test
+    public void testDecisionTreeSP() {
+        runDecisionTree(true, LopProperties.ExecType.SPARK);
+    }
+
+    private void runDecisionTree(boolean defaultProb, LopProperties.ExecType 
instType)
+    {
+        Types.ExecMode platformOld = setExecMode(instType);
+
+        try
+        {
+            loadTestConfiguration(getTestConfiguration(TEST_NAME));
+
+            String HOME = SCRIPT_DIR + TEST_DIR;
+            fullDMLScriptName = HOME + TEST_NAME + ".dml";
+            programArgs = new String[]{"-args", input("X"), input("Y"), 
input("R"), output("M") };
+            fullRScriptName = HOME + TEST_NAME + ".R";
+            rCmd = "Rscript" + " " + fullRScriptName + " " + inputDir() + " "  
+ expectedDir();
+
+            double[][] Y = getRandomMatrix(rows, 1, 0, 1, 1.0, 3);
+            for (int row = 0; row < rows; row++) {
+                Y[row][0] = (Y[row][0] > 0.5)? 1.0 : 0.0;
+            }
+
+            //generate actual dataset
+            double[][] X = getRandomMatrix(rows, cols, 0, 100, 1.0, 7);
+            for (int row = 0; row < rows/2; row++) {
+                X[row][2] = (Y[row][0] > 0.5)? 2.0 : 1.0;
+                X[row][3] = 1.0;
+            }
+            for (int row = rows/2; row < rows; row++) {
+                X[row][2] = 1.0;
+                X[row][3] = (Y[row][0] > 0.5)? 2.0 : 1.0;
+            }
+            writeInputMatrixWithMTD("X", X, true);
+            writeInputMatrixWithMTD("Y", Y, true);
+
+
+
+            double[][] R = getRandomMatrix(1, cols, 1, 1, 1.0, 1);
+            R[0][3] = 3.0;
+            R[0][2] = 3.0;
+            writeInputMatrixWithMTD("R", R, true);
+
+            runTest(true, false, null, -1);

Review comment:
       alternatively you can write an equivalent R script, to verify you would 
get the same results. But it is preferable if you could verify that the results 
are "correct" not the "same".

##########
File path: scripts/builtin/decisionTree.dml
##########
@@ -0,0 +1,518 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+#
+# THIS SCRIPT IMPLEMENTS CLASSIFICATION TREES WITH BOTH SCALE AND CATEGORICAL 
FEATURES
+#
+# INPUT         PARAMETERS:
+# 
---------------------------------------------------------------------------------------------
+# NAME          TYPE     DEFAULT      MEANING
+# 
---------------------------------------------------------------------------------------------
+# X             String   ---          Location to read feature matrix X; note 
that X needs to be both recoded and dummy coded
+# Y                    String   ---              Location to read label matrix 
Y; note that Y needs to be both recoded and dummy coded
+# R                    String   " "          Location to read the matrix R 
which for each feature in X contains the following information
+#                                                                              
- R[1,]: Row Vector which indicates if feature vector is scalar or categorical. 
1 indicates
+#                                     a scalar feature vector, other positive 
Integers indicate the number of categories
+#                                                                        If R 
is not provided by default all variables are assumed to be scale
+# bins          Int     20                       Number of equiheight bins per 
scale feature to choose thresholds
+# depth         Int     25                       Maximum depth of the learned 
tree
+# M             String          ---              Location to write matrix M 
containing the learned tree
+# 
---------------------------------------------------------------------------------------------
+# OUTPUT:
+# Matrix M where each column corresponds to a node in the learned tree and 
each row contains the following information:
+#       M[1,j]: id of node j (in a complete binary tree)
+#       M[2,j]: Offset (no. of columns) to left child of j if j is an internal 
node, otherwise 0
+#       M[3,j]: Feature index of the feature (scale feature id if the feature 
is scale or categorical feature id if the feature is categorical)
+#                       that node j looks at if j is an internal node, 
otherwise 0
+#       M[4,j]: Type of the feature that node j looks at if j is an internal 
node: holds the same information as R input vector
+#       M[5,j]: If j is an internal node: 1 if the feature chosen for j is 
scale, otherwise the size of the subset of values
+#                       stored in rows 6,7,... if j is categorical
+#                       If j is a leaf node: number of misclassified samples 
reaching at node j
+#       M[6:,j]: If j is an internal node: Threshold the example's feature 
value is compared to is stored at M[6,j] if the feature chosen for j is scale,
+#                        otherwise if the feature chosen for j is categorical 
rows 6,7,... depict the value subset chosen for j
+#                If j is a leaf node 1 if j is impure and the number of 
samples at j > threshold, otherwise 0
+# 
-------------------------------------------------------------------------------------------
+# HOW TO INVOKE THIS SCRIPT - EXAMPLE:
+# hadoop jar SystemDS.jar -f decision-tree.dml -nvargs X=INPUT_DIR/X 
Y=INPUT_DIR/Y R=INPUT_DIR/R M=OUTPUT_DIR/model
+#                                                                              
   bins=20 depth=25 num_leaf=10 num_samples=3000 impurity=Gini fmt=csv

Review comment:
       remove the hadoop example

##########
File path: scripts/builtin/decisionTree.dml
##########
@@ -0,0 +1,518 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+#
+# THIS SCRIPT IMPLEMENTS CLASSIFICATION TREES WITH BOTH SCALE AND CATEGORICAL 
FEATURES
+#
+# INPUT         PARAMETERS:
+# 
---------------------------------------------------------------------------------------------
+# NAME          TYPE     DEFAULT      MEANING
+# 
---------------------------------------------------------------------------------------------
+# X             String   ---          Location to read feature matrix X; note 
that X needs to be both recoded and dummy coded
+# Y                    String   ---              Location to read label matrix 
Y; note that Y needs to be both recoded and dummy coded
+# R                    String   " "          Location to read the matrix R 
which for each feature in X contains the following information
+#                                                                              
- R[1,]: Row Vector which indicates if feature vector is scalar or categorical. 
1 indicates
+#                                     a scalar feature vector, other positive 
Integers indicate the number of categories
+#                                                                        If R 
is not provided by default all variables are assumed to be scale
+# bins          Int     20                       Number of equiheight bins per 
scale feature to choose thresholds
+# depth         Int     25                       Maximum depth of the learned 
tree
+# M             String          ---              Location to write matrix M 
containing the learned tree
+# 
---------------------------------------------------------------------------------------------
+# OUTPUT:
+# Matrix M where each column corresponds to a node in the learned tree and 
each row contains the following information:
+#       M[1,j]: id of node j (in a complete binary tree)
+#       M[2,j]: Offset (no. of columns) to left child of j if j is an internal 
node, otherwise 0
+#       M[3,j]: Feature index of the feature (scale feature id if the feature 
is scale or categorical feature id if the feature is categorical)
+#                       that node j looks at if j is an internal node, 
otherwise 0
+#       M[4,j]: Type of the feature that node j looks at if j is an internal 
node: holds the same information as R input vector
+#       M[5,j]: If j is an internal node: 1 if the feature chosen for j is 
scale, otherwise the size of the subset of values
+#                       stored in rows 6,7,... if j is categorical
+#                       If j is a leaf node: number of misclassified samples 
reaching at node j
+#       M[6:,j]: If j is an internal node: Threshold the example's feature 
value is compared to is stored at M[6,j] if the feature chosen for j is scale,
+#                        otherwise if the feature chosen for j is categorical 
rows 6,7,... depict the value subset chosen for j
+#                If j is a leaf node 1 if j is impure and the number of 
samples at j > threshold, otherwise 0
+# 
-------------------------------------------------------------------------------------------
+# HOW TO INVOKE THIS SCRIPT - EXAMPLE:
+# hadoop jar SystemDS.jar -f decision-tree.dml -nvargs X=INPUT_DIR/X 
Y=INPUT_DIR/Y R=INPUT_DIR/R M=OUTPUT_DIR/model
+#                                                                              
   bins=20 depth=25 num_leaf=10 num_samples=3000 impurity=Gini fmt=csv
+
+
+
+# 
----------------------------------------------------------------------------------------------------------------------
+# Pseudo Code:
+# All ignoring NULL Features/COLUMNS/ROWS
+# calcImpurity(frame = Frame[], col, labels = Array[])
+#       returns impurity: Scale and splittingCriteria: Scalar or 
List{FeatureClassIndices}
+# calcBestSplittingCriteria(frame = Frame[], labels = Array[])
+#       runs through all features in frame and calculates the impurity
+#       returns column with the best (lowest) Impurity, and the 
splittingCriteria
+# splitData(frame = Frame, splittingCriteria: SplittingCriteria)
+#       returns FalseFrame and TrueFrame according to the Splitting Criteria 
(to keep the indices true fill unwanted Data with NULL)
+# calcLeftNode(i: Int) = i * 2
+#       returns it left NodeInBinTree (for Example: calcLeftNode(1) = 2, 
calcLeftNode(2) = 4, calcLeftNode(3) = 6)
+# -------------------------
+# inputData = read(X)
+# labels = read(Y)
+# inputLabels = (R == " ")? USE_SCALAR : read(R) 
USE_IT_TO_DETERMINE_IF_FEATURE_IS_SCALE_OR_LABELED
+#

Review comment:
       Move the m_decisionTree(...) returns(...) to be the first function.
   Since this is the entry point of the algorithm, and what we call from the 
system it should be what you read first in the file.
   
   Then afterwards i would remove or relocate the commented code here into the 
individual functions, and add comments to the individual functions instead.

##########
File path: 
src/test/java/org/apache/sysds/test/functions/builtin/BuiltinDecisionTreeTest.java
##########
@@ -0,0 +1,82 @@
+package org.apache.sysds.test.functions.builtin;
+
+import org.apache.sysds.common.Types;
+import org.apache.sysds.lops.LopProperties;
+import org.apache.sysds.test.AutomatedTestBase;
+import org.apache.sysds.test.TestConfiguration;
+import org.junit.Test;
+
+public class BuiltinDecisionTreeTest extends AutomatedTestBase
+{
+    private final static String TEST_NAME = "decisionTree";
+    private final static String TEST_DIR = "functions/builtin/";
+    private static final String TEST_CLASS_DIR = TEST_DIR + 
BuiltinDecisionTreeTest.class.getSimpleName() + "/";
+
+    private final static double eps = 1e-10;
+    private final static int rows = 6;
+    private final static int cols = 4;
+
+    @Override
+    public void setUp() {
+        addTestConfiguration(TEST_NAME, new TestConfiguration(TEST_CLASS_DIR, 
TEST_NAME, new String[]{"C"}));
+    }
+
+    @Test
+    public void testDecisionTreeDefaultCP() { runDecisionTree(true, 
LopProperties.ExecType.CP); }
+
+    @Test
+    public void testDecisionTreeSP() {
+        runDecisionTree(true, LopProperties.ExecType.SPARK);
+    }
+
+    private void runDecisionTree(boolean defaultProb, LopProperties.ExecType 
instType)
+    {
+        Types.ExecMode platformOld = setExecMode(instType);
+
+        try
+        {
+            loadTestConfiguration(getTestConfiguration(TEST_NAME));
+
+            String HOME = SCRIPT_DIR + TEST_DIR;
+            fullDMLScriptName = HOME + TEST_NAME + ".dml";
+            programArgs = new String[]{"-args", input("X"), input("Y"), 
input("R"), output("M") };
+            fullRScriptName = HOME + TEST_NAME + ".R";
+            rCmd = "Rscript" + " " + fullRScriptName + " " + inputDir() + " "  
+ expectedDir();
+
+            double[][] Y = getRandomMatrix(rows, 1, 0, 1, 1.0, 3);
+            for (int row = 0; row < rows; row++) {
+                Y[row][0] = (Y[row][0] > 0.5)? 1.0 : 0.0;
+            }
+
+            //generate actual dataset
+            double[][] X = getRandomMatrix(rows, cols, 0, 100, 1.0, 7);
+            for (int row = 0; row < rows/2; row++) {
+                X[row][2] = (Y[row][0] > 0.5)? 2.0 : 1.0;
+                X[row][3] = 1.0;
+            }
+            for (int row = rows/2; row < rows; row++) {
+                X[row][2] = 1.0;
+                X[row][3] = (Y[row][0] > 0.5)? 2.0 : 1.0;
+            }
+            writeInputMatrixWithMTD("X", X, true);
+            writeInputMatrixWithMTD("Y", Y, true);
+
+
+
+            double[][] R = getRandomMatrix(1, cols, 1, 1, 1.0, 1);
+            R[0][3] = 3.0;
+            R[0][2] = 3.0;
+            writeInputMatrixWithMTD("R", R, true);
+
+            runTest(true, false, null, -1);

Review comment:
       it is not enough to run the script. Because even if it fails it will 
still parse the test.
   
   You need to verify the output of the algorithm somehow, there are two 
options.
   
   1. Write the result matrix from the script, and then here read it in and 
verify the outputs.
   2. Print the result matrix from the script and use the std out capture 
returned from runTest, to verify the correct values are printed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [systemds] Baunsgaard commented on a change in pull request #1145: Decision Tree Feature

Reply via email to