[systemds] branch main updated: [SYSTEMDS-3277] Rework and cleanup decisionTree and decisionTreePredict

mboehm7 Fri, 24 Mar 2023 14:31:55 -0700

This is an automated email from the ASF dual-hosted git repository.

mboehm7 pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/systemds.git



The following commit(s) were added to refs/heads/main by this push:
     new a73d54a0bb [SYSTEMDS-3277] Rework and cleanup decisionTree and 
decisionTreePredict
a73d54a0bb is described below

commit a73d54a0bbf9a3267a382a2354dae4e38ef5cc1e
Author: Matthias Boehm <[email protected]>
AuthorDate: Fri Mar 24 22:27:57 2023 +0100

    [SYSTEMDS-3277] Rework and cleanup decisionTree and decisionTreePredict
    
    This patch makes a major overhaul to the existing decisionTree and
    decisionTreePredict built-in functions by vectorizing the internals
    of decisionTree and updating related predict functions accordingly.
    Furthermore, this patch includes various fixes so training and predict
    functions are in sync and can actually be used in practice again.
---
 scripts/builtin/decisionTree.dml                   | 575 ++++++---------------
 scripts/builtin/decisionTreePredict.dml            | 100 ++--
 scripts/builtin/randomForest.dml                   |  11 +-
 scripts/builtin/randomForestPredict.dml            |   4 +-
 .../part1/BuiltinDecisionTreePredictTest.java      |  27 +-
 .../builtin/part1/BuiltinDecisionTreeTest.java     |  30 +-
 .../scripts/functions/builtin/decisionTree.dml     |   7 +-
 .../functions/builtin/decisionTreePredict.dml      |   3 +-
 .../scripts/functions/builtin/randomForestTest.dml |   2 +-
 9 files changed, 268 insertions(+), 491 deletions(-)

diff --git a/scripts/builtin/decisionTree.dml b/scripts/builtin/decisionTree.dml
index fc436545c3..9b5b2c0093 100644
--- a/scripts/builtin/decisionTree.dml
+++ b/scripts/builtin/decisionTree.dml
@@ -19,424 +19,193 @@
 #
 #-------------------------------------------------------------
 
-# Builtin script implementing classification trees with scale and categorical 
features
+# This script implements decision trees for recoded and binned categorical and
+# numerical input features. We train a single CART (classification and
+# regression tree) decision trees depending on the provided labels y, either
+# classification (majority vote per leaf) or regression (average per leaf).
 #
 # INPUT:
-# 
----------------------------------------------------------------------------------------------
-# X         Feature matrix X; note that X needs to be both recoded and dummy 
coded
-# Y         Label matrix Y; note that Y needs to be both recoded and dummy 
coded
-# R         Matrix R which for each feature in X contains the following 
information
-#           - R[1,]: Row Vector which indicates if feature vector is scalar or 
categorical. 1 indicates
-#           a scalar feature vector, other positive Integers indicate the 
number of categories
-#           If R is not provided by default all variables are assumed to be 
scale
-# bins      Number of equiheight bins per scale feature to choose thresholds
-# depth     Maximum depth of the learned tree
-# verbose   boolean specifying if the algorithm should print information while 
executing
-# 
----------------------------------------------------------------------------------------------
+# 
------------------------------------------------------------------------------
+# X               Feature matrix in recoded/binned representation
+# y               Label matrix in recoded/binned representation
+# ctypes          Row-Vector of column types [1 scale/ordinal, 2 categorical]
+#                 of shape 1-by-(ncol(X)+1), where the last entry is the y type
+# max_depth       Maximum depth of the learned tree (stopping criterion)
+# min_leaf        Minimum number of samples in leaf nodes (stopping criterion)
+# min_split       Minimum number of samples in leaf for attempting a split
+# max_features    Parameter controlling the number of features used as split
+#                 candidates at tree nodes: m = ceil(num_features^max_features)
+# impurity        Impurity measure: entropy, gini (default)
+# seed            Fixed seed for randomization of samples and split candidates
+# verbose         Flag indicating verbose debug output
+# 
------------------------------------------------------------------------------
 #
 # OUTPUT:
-# 
-------------------------------------------------------------------------------------------
-# M      Matrix M where each column corresponds to a node in the learned tree 
and each row
-#        contains the following information:
-#        M[1,j]: id of node j (in a complete binary tree)
-#        M[2,j]: Offset (no. of columns) to left child of j if j is an 
internal node, otherwise 0
-#        M[3,j]: Feature index of the feature (scale feature id if the feature 
is scale or
-#        categorical feature id if the feature is categorical)
-#        that node j looks at if j is an internal node, otherwise 0
-#        M[4,j]: Type of the feature that node j looks at if j is an internal 
node: holds
-#        the same information as R input vector
-#        M[5,j]: If j is an internal node: 1 if the feature chosen for j is 
scale,
-#        otherwise the size of the subset of values
-#        stored in rows 6,7,... if j is categorical
-#        If j is a leaf node: number of misclassified samples reaching at node 
j
-#        M[6:,j]: If j is an internal node: Threshold the example's feature 
value is compared
-#        to is stored at M[6,j] if the feature chosen for j is scale,
-#        otherwise if the feature chosen for j is categorical rows 6,7,... 
depict the value subset chosen for j
-#        If j is a leaf node 1 if j is impure and the number of samples at j > 
threshold, otherwise 0
-# 
-------------------------------------------------------------------------------------------
-
-m_decisionTree = function(
-  Matrix[Double] X,
-  Matrix[Double] Y,
-  Matrix[Double] R,
-  Integer bins = 10,
-  Integer depth = 20,
-  Boolean verbose = FALSE
-) return (Matrix[Double] M) {
-  if (verbose) {
-    print("Executing Decision Tree:")
-  }
-  node_queue = matrix(1, rows=1, cols=1)     # Add first Node
-  impurity_queue = matrix(1, rows=1, cols=1)
-  use_cols_queue = matrix(1, rows=ncol(X), cols=1)   # Add fist bool Vector 
with all cols <=> (use all cols)
-  use_rows_queue = matrix(1, rows=nrow(X), cols=1)   # Add fist bool Vector 
with all rows <=> (use all rows)
-  queue_length = 1
-  M = matrix(0, rows = 0, cols = 0)
-  while (queue_length > 0) {
-    [node_queue, node] = dataQueuePop(node_queue)
-    [use_rows_queue, use_rows_vector] = dataQueuePop(use_rows_queue)
-    [use_cols_queue, use_cols_vector] = dataQueuePop(use_cols_queue)
-
-    available_rows = calcAvailable(use_rows_vector)
-    available_cols = calcAvailable(use_cols_vector)
-    [impurity_queue, parent_impurity] = dataQueuePop(impurity_queue)
-    create_child_nodes_flag = FALSE
-      if (verbose) {
-        print("Popped Node:  " + as.scalar(node))
-        print("Rows:     " + toString(t(use_rows_vector)))
-        print("Cols:     " + toString(t(use_cols_vector)))
-        print("Available Rows:   " + available_rows)
-        print("Available Cols:   " + available_cols)
-        print("Parent impurity:  " + as.scalar(parent_impurity))
-      }
-  
-    node_depth = calculateNodeDepth(node)
-    used_col = 0.0
-    if (node_depth < depth & available_rows > 1 & available_cols > 0 & 
as.scalar(parent_impurity) > 0) {
-      [impurity, used_col, threshold, type] = calcBestSplittingCriteria(X, Y, 
R, use_rows_vector, use_cols_vector, bins)
-      create_child_nodes_flag = impurity < as.scalar(parent_impurity)
-        if (verbose) {
-      print("Current impurity:   " + impurity)
-      print("Current threshold:  "+ toString(t(threshold)))
-        }
-    }
-    if (verbose) {
-
-     print("Current column:   " + used_col)
-     print("Current type:   " + type)
-    }
-    if (create_child_nodes_flag) {
-      [left, right] = calculateChildNodes(node)
-      node_queue = dataQueuePush(left, right, node_queue)
-
-      [new_use_cols_vector, left_use_rows_vector, right_use_rows_vector] = 
splitData(X, use_rows_vector, use_cols_vector, used_col, threshold, type)
-      use_rows_queue = dataQueuePush(left_use_rows_vector, 
right_use_rows_vector, use_rows_queue)
-      use_cols_queue = dataQueuePush(new_use_cols_vector, new_use_cols_vector, 
use_cols_queue)
-
-      impurity_queue = dataQueuePush(matrix(impurity, rows = 1, cols = 1), 
matrix(impurity, rows = 1, cols = 1), impurity_queue)
-      offset = dataQueueLength(node_queue) - 1
-      M = outputMatrixBind(M, node, offset, used_col, R, threshold)
-    } else {
-      M = outputMatrixBind(M, node, 0.0, used_col, R, matrix(0, rows = 1, cols 
= 1))
-    }
-    queue_length = dataQueueLength(node_queue)# -- user-defined function calls 
not supported in relational expressions
-
-    if (verbose) {
-      print("New QueueLen:   " + queue_length)
-      print("")
-    }
-  }
-}
-
-dataQueueLength = function(Matrix[Double] queue)  return (Double len) {
-  len = ncol(queue)
-}
-
-dataQueuePop = function(Matrix[Double] queue)  return (Matrix[Double] 
new_queue, Matrix[Double] node) {
-  node = matrix(queue[,1], rows=1, cols=nrow(queue))   # reshape to force the 
creation of a new object
-  node = matrix(node, rows=nrow(queue), cols=1)    # reshape to force the 
creation of a new object
-  len = dataQueueLength(queue)
-  if (len < 2) {
-    new_queue = matrix(0,0,0)
-  } else {
-    new_queue = matrix(queue[,2:ncol(queue)], rows=nrow(queue), 
cols=ncol(queue)-1)
-  }
-}
-
-dataQueuePush = function(Matrix[Double] left, Matrix[Double] right, 
Matrix[Double] queue)  return (Matrix[Double] new_queue) {
-  len = dataQueueLength(queue)
-  if(len <= 0) {
-    new_queue = cbind(left, right)
-  } else {
-    new_queue = cbind(queue, left, right)
-  }
-}
-
-dataVectorLength = function(Matrix[Double] vector)  return (Double len) {
-  len = nrow(vector)
-}
-
-dataColVectorLength = function(Matrix[Double] vector)  return (Double len) {
-  len = ncol(vector)
-}
-
-dataVectorGet = function(Matrix[Double] vector, Double index)  return (Double 
value) {
-  value = as.scalar(vector[index, 1])
-}
-
-dataVectorSet = function(Matrix[Double] vector, Double index, Double data) 
return (Matrix[Double] new_vector) {
-  vector[index, 1] = data
-  new_vector = vector
-}
-
-calcAvailable = function(Matrix[Double] vector) return(Double 
available_elements){
-  len = dataVectorLength(vector)
-  available_elements = 0.0
-  for (index in 1:len) {
-    element = dataVectorGet(vector, index)
-    if(element > 0.0) {
-      available_elements = available_elements + 1.0
-    }
-  }
-}
-
-calculateNodeDepth = function(Matrix[Double] node)  return(Double depth) {
-  depth = log(as.scalar(node), 2) + 1
-}
-
-calculateChildNodes = function(Matrix[Double] node)  return(Matrix[Double] 
left, Matrix[Double] right) {
-  left = node * 2.0
-  right = node * 2.0 + 1.0
-}
-
-getTypeOfCol = function(Matrix[Double] R, Double col)  return(Double type) {  
# 1..scalar,  2..categorical
-  type = as.scalar(R[1, col])
-}
-
-extrapolateOrderedScalarFeatures = function(
-  Matrix[Double] X,
-  Matrix[Double] use_rows_vector,
-  Double col) return (Matrix[Double] feature_vector) {
-  feature_vector = matrix(1, rows = 1, cols = 1)
-  len = nrow(X)
-  first_time = TRUE
-  for(row in 1:len) {
-    use_feature = dataVectorGet(use_rows_vector, row)
-    if (use_feature != 0) {
-      if(first_time) {
-        feature_vector[1,1] = X[row, col]
-        first_time = FALSE
-      } else {
-        feature_vector = rbind(feature_vector, X[row, col])
-      }
-    }
+# 
------------------------------------------------------------------------------
+# M              Matrix M containing the learne trees, in linearized form
+#                For example, give a feature matrix with features [a,b,c,d]
+#                and the following trees, M would look as follows:
+#
+#                (L1)               |d<5|
+#                                  /     \
+#                (L2)           P1:2    |a<7|
+#                                       /   \
+#                (L3)                 P2:2 P3:1
+#
+#                --> M :=
+#                [[4, 5, 0, 2, 1, 7, 0, 0, 0, 0, 0, 2, 0, 1]]
+#                 |(L1)| |  (L2)   | |        (L3)         |
+# 
------------------------------------------------------------------------------
+
+m_decisionTree = function(Matrix[Double] X, Matrix[Double] y, Matrix[Double] 
ctypes,
+    Int max_depth = 10, Int min_leaf = 20, Int min_split = 50, Double 
max_features = 0.5,
+    String impurity = "gini", Int seed = -1, Boolean verbose = FALSE)
+  return(Matrix[Double] M)
+{
+  t1 = time();
+
+  # initialize input data and basic statistics
+  m = nrow(X); n = ncol(X);
+  classify = (as.scalar(ctypes[1,n+1]) == 2);
+  fdom = colMaxs(X);                  # num distinct per feature
+  foffb = t(cumsum(t(fdom))) - fdom;  # feature begin
+  foffe = t(cumsum(t(fdom)))          # feature end
+  rix = matrix(seq(1,m)%*%matrix(1,1,n), m*n, 1)
+  cix = matrix(X + foffb, m*n, 1);
+  X2 = table(rix, cix, 1, m, as.scalar(foffe[,n]), FALSE); #one-hot encoded
+  y2 = table(seq(1,m), y);
+  cnt = colSums(X2);
+  I = matrix(1, rows=nrow(X), cols=1);
+
+  if( verbose ) {
+    print("decisionTree: initialize with max_depth=" + max_depth + ", 
max_features="
+      + max_features + ", impurity=" + impurity + ", seed=" + seed + ".");
+    print("decisionTree: basic statistics:");
+    print("-- impurity: " + computeImpurity(y2, I, impurity) );
+    print("-- minCount: " + min(cnt));
+    print("-- maxCount: " + max(cnt));
   }
-  feature_vector = order(target=feature_vector, by=1, decreasing=FALSE, 
index.return=FALSE)
-}
 
-calcPossibleThresholdsScalar = function(
-  Matrix[Double] X,
-  Matrix[Double] use_rows_vector,
-  Double col,
-  int bins) return (Matrix[Double] thresholds) {
-  ordered_features = extrapolateOrderedScalarFeatures(X, use_rows_vector, col)
-  ordered_features_len = dataVectorLength(ordered_features)
-  thresholds = matrix(1, rows = 1, cols = ordered_features_len - 1)
-  virtual_length = min(ordered_features_len, 20)
-  step_length = ordered_features_len / virtual_length
-  if (ordered_features_len > 1) {
-    for (index in 1:(virtual_length - 1)) {
-      real_index = index * step_length
-      mean = (dataVectorGet(ordered_features, real_index) + 
dataVectorGet(ordered_features, real_index + 1)) / 2
-      thresholds[1, index] = mean
+  # queue-based node splitting
+  M = matrix(0, rows=1, cols=2*(2^max_depth-1))
+  queue = list(list(1,I)); # node IDs / data indicators
+  maxPath = 1;
+  while( length(queue) > 0 ) {
+    # pop next node from queue for splitting
+    [queue, node0] = remove(queue, 1);
+    node = as.list(node0);
+    nID = as.scalar(node[1]);
+    nI = as.matrix(node[2]);
+    if(verbose)
+      print("decisionTree: attempting split of node "+nID+" ("+sum(nI)+" 
rows)");
+
+    # find best split attribute
+    nSeed = ifelse(seed==-1, seed, seed*nID);
+    [f, v, IDleft, Ileft, IDright, Iright] = findBestSplit(
+      X2, y2, foffb, foffe, nID, nI, min_leaf, max_features, impurity, nSeed);
+    validSplit = sum(Ileft) >= min_leaf & sum(Iright) >= min_leaf;
+    if(verbose)
+      print("-- best split: "+f+":"+v+" -> valid="+validSplit);
+    if( validSplit )
+      M[, 2*nID-1:2*nID] = t(as.matrix(list(f,v)));
+    else
+      M[, 2*nID] = computeLeafLabel(y2, nI, classify);
+    maxPath = max(maxPath, floor(log(nID,2)+1));
+
+    # split data, finalize or recurse
+    if( validSplit ) {
+      if( sum(Ileft) >= min_split & floor(log(IDleft,2))+2 < max_depth )
+        queue = append(queue, list(IDleft,Ileft));
+      else
+        M[,2*IDleft] = computeLeafLabel(y2, Ileft, classify)
+      if( sum(Iright) >= min_split & floor(log(IDright,2))+2 < max_depth )
+        queue = append(queue, list(IDright,Iright));
+      else
+        M[,2*IDright] = computeLeafLabel(y2, Iright, classify)
+      maxPath = max(maxPath, floor(log(IDleft,2)+1));
     }
   }
-}
 
-calcPossibleThresholdsCategory = function(Double type) return (Matrix[Double] 
thresholds) {
-  numberThresholds = 2 ^ type
-  thresholds = matrix(-1, rows = type, cols = numberThresholds)
-  toggleFactor = numberThresholds / 2
+  # summary and encoding
+  M = M[1, 1:2*(2^maxPath-1)];
 
-  for (index in 1:type) {
-    beginCols = 1
-    endCols = toggleFactor
-    iterations = numberThresholds / toggleFactor / 2
-    for (it in 1:iterations) {
-      category_val = type - index + 1
-      thresholds[index, beginCols:endCols] = matrix(category_val, rows = 1, 
cols = toggleFactor)
-      endCols = endCols + 2 * toggleFactor
-      beginCols = beginCols + 2 * toggleFactor
-    }
-    toggleFactor = toggleFactor / 2
-    iterations = numberThresholds / toggleFactor / 2
-  }
-  ncol = ncol(thresholds)
-  if (ncol > 2.0) {
-    thresholds = cbind(thresholds[,2:ncol-2], thresholds[,ncol-1])
+  if(verbose) {
+    print("decisionTree: final constructed tree (linearized):");
+    print("--" + toString(M));
   }
 }
 
-calcGiniImpurity = function(Double num_true, Double num_false) return (Double 
impurity) {
-  prop_true = num_true / (num_true + num_false)
-  prop_false = num_false / (num_true + num_false)
-  impurity = 1 - (prop_true ^ 2) - (prop_false ^ 2)
-}
-
-calcImpurity = function(
-  Matrix[Double] X,
-  Matrix[Double] Y,
-  Matrix[Double] use_rows_vector,
-  Double col,
-  Double type,
-  int bins) return (Double impurity, Matrix[Double] threshold) {
-
-  is_scalar_type = typeIsScalar(type)
-  if (is_scalar_type) {
-    possible_thresholds = calcPossibleThresholdsScalar(X, use_rows_vector, 
col, bins)
-  } else {
-    possible_thresholds = calcPossibleThresholdsCategory(type)
-  }
-  len_thresholds = ncol(possible_thresholds)
-  impurity = 1
-  threshold = matrix(0, rows=1, cols=1)
-  for (index in 1:len_thresholds) {
-    [false_rows, true_rows] = splitRowsVector(X, use_rows_vector, col, 
possible_thresholds[, index], type)
-    num_true_positive = 0; num_false_positive = 0; num_true_negative = 0; 
num_false_negative = 0
-    len = dataVectorLength(use_rows_vector)
-    for (c_row in 1:len) {
-      true_row_data = dataVectorGet(true_rows, c_row)
-      false_row_data = dataVectorGet(false_rows, c_row)
-      if (true_row_data != 0 & false_row_data == 0) { # IT'S POSITIVE!
-        if (as.scalar(Y[c_row, 1]) != 0) {
-          num_true_positive = num_true_positive + 1
-        } else {
-          num_false_positive = num_false_positive + 1
-        }
-      } else if (true_row_data == 0 & false_row_data != 0) { # IT'S NEGATIVE
-        if (as.scalar(Y[c_row, 1]) != 0.0) {
-          num_false_negative = num_false_negative + 1
-        } else {
-          num_true_negative = num_true_negative + 1
-        }
-      }
-    }
-    impurity_positive_branch = calcGiniImpurity(num_true_positive, 
num_false_positive)
-    impurity_negative_branch = calcGiniImpurity(num_true_negative, 
num_false_negative)
-    num_samples = num_true_positive + num_false_positive + num_true_negative + 
num_false_negative
-    num_negative = num_true_negative + num_false_negative
-    num_positive = num_true_positive + num_false_positive
-    c_impurity = num_positive / num_samples * impurity_positive_branch + 
num_negative / num_samples * impurity_negative_branch
-    if (c_impurity <= impurity) {
-      impurity = c_impurity
-      threshold = possible_thresholds[, index]
-    }
+findBestSplit = function(Matrix[Double] X2, Matrix[Double] y2, Matrix[Double] 
foffb, Matrix[Double] foffe,
+    Int ID, Matrix[Double] I, Int min_leaf, Double max_features, String 
impurity, Int seed)
+  return(Int f, Int v, Int IDleft, Matrix[Double] Ileft, Int IDright, 
Matrix[Double] Iright)
+{
+  # sample features iff max_features < 1
+  n = ncol(foffb);
+  numI = sum(I);
+  feat = seq(1,n);
+  if( max_features < 1.0 ) {
+    rI = rand(rows=n, cols=1, seed=seed) <= (n^max_features/n);
+    feat = removeEmpty(target=feat, margin="rows", select=rI);
   }
-}
-
-calcBestSplittingCriteria = function(
-  Matrix[Double] X,
-  Matrix[Double] Y,
-  Matrix[Double] R,
-  Matrix[Double] use_rows_vector,
-  Matrix[Double] use_cols_vector,
-  int bins)  return (Double impurity, Double used_col, Matrix[Double] 
threshold, Double type) {
-
-  impurity = 1
-  used_col = 1
-  threshold = matrix(0, 1, 1)
-  type = 1
-  # -- user-defined function calls not supported for iterable predicates
-  len = dataVectorLength(use_cols_vector)
-  for (c_col in 1:len) {
-    use_feature = dataVectorGet(use_cols_vector, c_col)
-    if (use_feature != 0) {
-      c_type = getTypeOfCol(R, c_col)
-      [c_impurity, c_threshold] = calcImpurity(X, Y, use_rows_vector, c_col, 
c_type, bins)
-      if(c_impurity <= impurity) {
-        impurity = c_impurity
-        used_col = c_col
-        threshold = c_threshold
-        type = c_type
-      }
-    }
-  }
-}
-
-typeIsScalar = function(Double type) return(Boolean b) {
-  b = type == 1.0
-}
-
-splitRowsVector = function(
-  Matrix[Double] X,
-  Matrix[Double] use_rows_vector,
-  Double col,
-  Matrix[Double] threshold,
-  Double type
-) return (Matrix[Double] false_use_rows_vector, Matrix[Double] 
true_use_rows_vector) {
-  type_is_scalar = typeIsScalar(type)
-  false_use_rows_vector = use_rows_vector
-  true_use_rows_vector = use_rows_vector
 
-  if (type_is_scalar) {
-    scalar_threshold = as.scalar(threshold[1,1])
-    len = dataVectorLength(use_rows_vector)
-    for (c_row in 1:len) {
-      row_enabled = dataVectorGet(use_rows_vector, c_row)
-      if (row_enabled != 0) {
-      if (as.scalar(X[c_row, col]) > scalar_threshold) {
-        false_use_rows_vector = dataVectorSet(false_use_rows_vector, c_row, 
0.0)
-      } else {
-        true_use_rows_vector = dataVectorSet(true_use_rows_vector, c_row, 0.0)
-      }
-      }
+  # evaluate features and feature splits
+  # (both categorical and numerical are treated similarly by
+  # finding a cutoff point in the recoded/binned representation)
+  R = matrix(0, rows=3, cols=nrow(feat));
+  parfor( i in 1:nrow(feat) ) {
+    f = as.scalar(feat[i]);
+    beg = as.scalar(foffb[1,f])+1;
+    end = as.scalar(foffe[1,f]);
+    bestig = 0.0; bestv = -1;
+    for(j in beg:end-1 ) { # lte semantics
+       # construct predicate 0/1 vector
+       p = table(seq(beg, j), 1, ncol(X2), 1);
+       # find rows that match at least one value and appear in I
+       Ileft = ((X2 %*% p) * I) != 0;
+       Iright = I * (Ileft==0);
+       # compute information gain
+       ig = computeImpurity(y2, I, impurity)
+            - sum(Ileft)/numI * computeImpurity(y2, Ileft, impurity)
+            - sum(Iright)/numI * computeImpurity(y2, Iright, impurity);
+       # track best split value and index, incl validity
+       if( ig > bestig & sum(Ileft) >= min_leaf & sum(Iright) >= min_leaf ) {
+          bestig = ig;
+          bestv = j;
+       }
     }
-  } else {
-    len = dataVectorLength(use_rows_vector)
-    for (c_row in 1:len) {
-      row_enabled = dataVectorGet(use_rows_vector, c_row)
-      if (row_enabled != 0) {
-        categories_len = dataColVectorLength(threshold)
-        move_sample_to_true_set = FALSE
-        for (category_col_index in 1:categories_len) {
-          desired_category = as.scalar(X[c_row, col])
-          if(desired_category != -1) {
-            category_of_threshold = threshold[type - desired_category + 1, 
category_col_index]
-            move_sample_to_true_set = as.scalar(X[c_row, col]) == 
as.scalar(category_of_threshold)
-          } else {
-            #Todo: has category -1 to be considered?
-            move_sample_to_true_set = TRUE
-          }
-        }
-        if (move_sample_to_true_set) {
-          false_use_rows_vector = dataVectorSet(false_use_rows_vector, c_row, 
0.0)
-        } else {
-          true_use_rows_vector = dataVectorSet(true_use_rows_vector, c_row, 
0.0)
-        }
-      }
-    }
-  }
-}
-
-splitData = function(
-  Matrix[Double] X,
-  Matrix[Double] use_rows_vector,
-  Matrix[Double] use_cols_vector,
-  Double col,
-  Matrix[Double] threshold,
-  Double type
-) return (Matrix[Double] new_use_cols_vector, Matrix[Double] 
false_use_rows_vector, Matrix[Double] true_use_rows_vector) {
-  new_use_cols_vector = dataVectorSet(use_cols_vector, col, 0.0)
-  [false_use_rows_vector, true_use_rows_vector] = splitRowsVector(X, 
use_rows_vector, col, threshold, type)
-}
-
-outputMatrixBind = function(
-  Matrix[Double] M,
-  Matrix[Double] node,
-  Double offset,
-  Double used_col,
-  Matrix[Double] R,
-  Matrix[Double] threshold
-) return (Matrix[Double] new_M) {
-  col = matrix(0, rows = 5, cols = 1)
-  col[1, 1] = node[1, 1]
-  col[2, 1] = offset
-  col[3, 1] = used_col
-  if (used_col >= 1.0) { col[4, 1] = R[1, used_col] }
-  col[5, 1] = nrow(threshold)
-  col = rbind(col, threshold)
-
-  if (ncol(M) == 0 & nrow(M) == 0) {
-    new_M = col
-  } else {
-    row_difference = nrow(M) - nrow(col)
-  if (row_difference < 0.0) {
-    buffer = matrix(-1, rows = -row_difference, cols = ncol(M))
-    M = rbind(M, buffer)
-  } else if (row_difference > 0.0) {
-    buffer = matrix(-1, rows = row_difference, cols = 1)
-    col = rbind(col, buffer)
-  }
-    new_M = cbind(M, col)
+    R[,i] = as.matrix(list(f, bestig, bestv));
   }
+  ix = as.scalar(rowIndexMax(R[2,]));
+
+  # extract indicators and IDs
+  IDleft = 2 * ID;
+  IDright= 2 * ID + 1;
+  f = as.integer(as.scalar(feat[ix,1]));
+  beg = as.scalar(foffb[1,f]);
+  v = as.integer(as.scalar(R[3,ix])-beg);
+  while(FALSE){} # TODO make beg automatically known
+  p = table(seq(beg+1, beg+v), 1, ncol(X2), 1);
+  Ileft = ((X2 %*% p) * I) != 0;
+  Iright = I * (Ileft==0);
+}
+
+computeImpurity = function(Matrix[Double] y2, Matrix[Double] I, String 
impurity)
+  return(Double score)
+{
+  f = colSums(y2 * I) / sum(I); # rel. freq. per category/bin
+  score = 0.0;
+  if( impurity == "gini" )
+    score = 1 - sum(f^2); # sum(f*(1-f));
+  else if( impurity == "entropy" )
+    score = sum(-f * log(f));
+  else
+    stop("decisionTree: unsupported impurity measure: "+impurity);
+}
+
+computeLeafLabel = function(Matrix[Double] y2, Matrix[Double] I, Boolean 
classify)
+  return(Double label)
+{
+  f = colSums(y2 * I) / sum(I);
+  label = ifelse(classify,
+    as.scalar(rowIndexMax(f)), sum(t(f)*seq(1,ncol(f))));
 }
diff --git a/scripts/builtin/decisionTreePredict.dml 
b/scripts/builtin/decisionTreePredict.dml
index d54e784e19..b312910a48 100644
--- a/scripts/builtin/decisionTreePredict.dml
+++ b/scripts/builtin/decisionTreePredict.dml
@@ -19,59 +19,51 @@
 #
 #-------------------------------------------------------------
 
-#
-# Builtin script implementing prediction based on classification trees with 
scale features using prediction methods of the
+# This script implements random forest prediction for recoded and binned
+# categorical and numerical input features.
 # Hummingbird paper (https://www.usenix.org/system/files/osdi20-nakandala.pdf).
 #
 # INPUT:
-# ------------------------------------------------------------------------
-# M           Decision tree matrix M, as generated by 
scripts/builtin/decisionTree.dml, where each column corresponds 
-#             to a node in the learned tree and each row contains the 
following information:
-#             M[1,j]: id of node j (in a complete binary tree)
-#             M[2,j]: Offset (no. of columns) to left child of j if j is an 
internal node, otherwise 0
-#             M[3,j]: Feature index of the feature (scale feature id if the 
feature is scale or
-#             categorical feature id if the feature is categorical)
-#             that node j looks at if j is an internal node, otherwise 0
-#             M[4,j]: Type of the feature that node j looks at if j is an 
internal node: holds
-#             the same information as R input vector
-#             M[5,j]: If j is an internal node: 1 if the feature chosen for j 
is scale,
-#             otherwise the size of the subset of values
-#             stored in rows 6,7,... if j is categorical
-#             If j is a leaf node: number of misclassified samples reaching at 
node j
-#             M[6:,j]: If j is an internal node: Threshold the example's 
feature value is compared
-#             to is stored at M[6,j] if the feature chosen for j is scale,
-#             otherwise if the feature chosen for j is categorical rows 
6,7,... depict the value subset chosen for j
-#             If j is a leaf node 1 if j is impure and the number of samples 
at j > threshold, otherwise 0
-# X           Feature matrix X
-# strategy    Prediction strategy, can be one of ["GEMM", "TT", "PTT"], 
referring to "Generic matrix multiplication", 
-#             "Tree traversal", and "Perfect tree traversal", respectively
-# ----------------------------------------------------------------------
+# 
------------------------------------------------------------------------------
+# X               Feature matrix in recoded/binned representation
+# y               Label matrix in recoded/binned representation,
+#                 optional for accuracy evaluation
+# ctypes          Row-Vector of column types [1 scale/ordinal, 2 categorical]
+# M               Matrix M holding the learned tree in linearized form
+#                 see decisionTree() for the detailed tree representation.
+# strategy        Prediction strategy, can be one of ["GEMM", "TT", "PTT"],
+#                 referring to "Generic matrix multiplication",
+#                 "Tree traversal", and "Perfect tree traversal", respectively
+# verbose         Flag indicating verbose debug output
+# 
------------------------------------------------------------------------------
 #
 # OUTPUT:
-# ------------------------------------------------------------------
-# Y     Matrix containing the predicted labels for X 
-# ------------------------------------------------------------------
+# 
------------------------------------------------------------------------------
+# yhat            Label vector of predictions
+# 
------------------------------------------------------------------------------
 
-m_decisionTreePredict = function(Matrix[Double] M, Matrix[Double] X, String 
strategy="TT")
-  return (Matrix[Double] Y) 
+m_decisionTreePredict = function(Matrix[Double] X, Matrix[Double] y = 
matrix(0,0,0),
+    Matrix[Double] ctypes, Matrix[Double] M, String strategy="TT", Boolean 
verbose = FALSE)
+  return (Matrix[Double] yhat)
 {
+    print(toString(M))
   if( strategy == "TT" )
-    Y = predict_TT(M, X);
+    yhat = predict_TT(M, X);
   else if( strategy == "GEMM" )
-    Y = predict_GEMM(M, X);
+    yhat = predict_GEMM(M, X);
   else {
     print ("No such strategy" + strategy)
-    Y = matrix("0", rows=0, cols=0)
+    yhat = matrix("0", rows=0, cols=0)
   }
 }
 
 predict_TT = function (Matrix[Double] M, Matrix[Double] X) 
-  return (Matrix[Double] Y)
+  return (Matrix[Double] yhat)
 {
   # initialization of model tensors and parameters
-  [N_L, N_R, N_F, N_T] = createTTNodeTensors(M)
-  nr = nrow(X); n = ncol(M);
-  tree_depth = ceiling(log(n+1,2)) # max depth
+  [N_L, N_R, N_F, N_T, C] = createTTNodeTensors(M)
+  nr = nrow(X); n = ncol(N_L);
+  tree_depth = ceiling(log(max(N_L)+1,2)) # max depth
 
   Ti = matrix(1, nr, 1); # current nodes (start at root)
   noChange = FALSE; i = 1;
@@ -83,14 +75,14 @@ predict_TT = function (Matrix[Double] M, Matrix[Double] X)
     TL = P %*% t(N_L); # get node left paths
     TR = P %*% t(N_R); # get node right paths
     # pick left or right path for each record separately
-    Ti_new = ifelse(Tv < Tt, TL, TR);
+    Ti_new = ifelse(Tv <= Tt, TL, TR);
     noChange = (sum(Ti != Ti_new) == 0);
     i = i + 1;
     Ti = Ti_new;
   }
 
   # extract classes
-  Y = t(table(seq(1,nr), Ti, nr, n) %*%  t(M[4,]));
+  yhat = table(seq(1,nr), Ti, nr, n) %*%  C;
 }
 
 predict_GEMM = function (Matrix[Double] M, Matrix[Double] X)
@@ -100,31 +92,38 @@ predict_GEMM = function (Matrix[Double] M, Matrix[Double] 
X)
   [A, B, C, D, E] = createGEMMNodeTensors(M, ncol(X));
 
   # scoring pipline, evaluating all nodes in parallel
-  Y = t(rowIndexMax(((((X %*% A) < B) %*% C) == D) %*% E));
+  Y = rowIndexMax(((((X %*% A) < B) %*% C) == D) %*% E);
 }
 
 createTTNodeTensors = function( Matrix[Double] M )
-  return ( Matrix[Double] N_L, Matrix[Double] N_R, Matrix[Double] N_F, 
Matrix[Double] N_T)
+  return ( Matrix[Double] N_L, Matrix[Double] N_R, Matrix[Double] N_F, 
Matrix[Double] N_T, Matrix[Double] C)
 {
-  N = M[1,] # all tree nodes
-  I = M[2,] # list of node offsets to their left children
-  n_nodes  = ncol(N)
+  # all tree nodes (inner and leaf nodes)
+  M2 = matrix(M, rows=ncol(M)/2, cols=2);
+
+  NID = seq(1, nrow(M2));
+  nI = (M2[,1]!=0 | M2[,2]!=0)
+  N = t(removeEmpty(target=NID, margin="rows", select=nI));
+  n_nodes = ncol(N)
 
   # left/right child node ids, default self-id
-  P1 = table(seq(1,ncol(N)), seq(1,ncol(I))+t(I[1,]));
-  N_L = ifelse(I[1,]!=0, t(P1 %*% t(N)), t(seq(1, n_nodes)));
-  P2 = table(seq(1,ncol(N)), t(N_L+1), ncol(N), ncol(N));
-  N_R = ifelse(I[1,]!=0, t(P2 %*% t(N)), t(seq(1, n_nodes)));
+  N_L = t(removeEmpty(target=ifelse(M2[,1]!=0, 2*NID, NID), margin="rows", 
select=nI));
+  N_R = t(removeEmpty(target=ifelse(M2[,1]!=0, 2*NID+1, NID), margin="rows", 
select=nI));
 
   # node feature IDs (positions) and threshold values
-  N_F = ifelse(M[3,]!=0, M[3,], 1);
-  N_T = M[6,]; # threshold values for inner nodes, otherwise 0
+  N_F = t(removeEmpty(target=ifelse(M2[,1]!=0, M2[,1], 1), margin="rows", 
select=nI));
+  N_T = t(removeEmpty(target=ifelse(M2[,1]!=0, M2[,2], 0), margin="rows", 
select=nI));
+
+  C = removeEmpty(target=M2[,2], margin="rows", select=nI);
 }
 
 createGEMMNodeTensors = function( Matrix[Double] M, Int m )
   return (Matrix[Double] A, Matrix[Double] B, Matrix[Double] C,
   Matrix[Double] D, Matrix[Double] E)
 {
+  #TODO update for new model layout and generalize
+  stop("GEMM not fully supported yet");
+
   nin = sum(M[2,]!=0); # num inner nodes
 
   # predicate map [#feat x #inodes] and values [1 x #inodes]
@@ -135,8 +134,7 @@ createGEMMNodeTensors = function( Matrix[Double] M, Int m )
   # bucket paths [#inodes x #paths] and path sums
   I2 = (M[2,] == 0)
   np = ncol(M) - nin;
-  C = matrix("1 1 -1 -1 1 -1 0 0 0 0 1 -1",
-       rows=3, cols=4); # TODO general case
+  C = matrix("1 -1", rows=1, cols=2); # TODO general case
   D = colSums(max(C, 0));
 
   # class map [#paths x #classes]
diff --git a/scripts/builtin/randomForest.dml b/scripts/builtin/randomForest.dml
index df130d3c80..176628e3a6 100644
--- a/scripts/builtin/randomForest.dml
+++ b/scripts/builtin/randomForest.dml
@@ -31,6 +31,7 @@
 # X               Feature matrix in recoded/binned representation
 # y               Label matrix in recoded/binned representation
 # ctypes          Row-Vector of column types [1 scale/ordinal, 2 categorical]
+#                 of shape 1-by-(ncol(X)+1), where the last entry is the y type
 # num_trees       Number of trees to be learned in the random forest model
 # sample_frac     Sample fraction of examples for each tree in the forest
 # feature_frac    Sample fraction of features for each tree in the forest
@@ -78,8 +79,8 @@ m_randomForest = function(Matrix[Double] X, Matrix[Double] y, 
Matrix[Double] cty
     print("randomForest: initialize with num_trees=" + num_trees + ", 
sample_frac=" + sample_frac
       + ", feature_frac=" + feature_frac + ", impurity=" + impurity + ", 
seed=" + seed + ".");
   }
-  if(ncol(ctypes) != ncol(X))
-    stop("randomForest: inconsistent num features and col types: "+ncol(X)+" 
vs "+ncol(ctypes)+".");
+  if(ncol(ctypes) != ncol(X)+1)
+    stop("randomForest: inconsistent num features (incl. label) and col types: 
"+ncol(X)+" vs "+ncol(ctypes)+".");
   if(sum(y <= 0) != 0)
     stop("randomForest: y is not properly recoded and binned (contiguous 
positive integers).");
   if(max(y) == 1)
@@ -89,7 +90,7 @@ m_randomForest = function(Matrix[Double] X, Matrix[Double] y, 
Matrix[Double] cty
   randSeeds = rand(rows = 3 * num_trees, cols = 1, seed=lseed, min=0, max=1e9);
 
   # training of num_tree decision trees
-  M = matrix(0, rows=num_trees, cols=2^max_depth-1);
+  M = matrix(0, rows=num_trees, cols=2*(2^max_depth-1));
   F = matrix(0, rows=num_trees, cols=ncol(X));
   parfor(i in 1:num_trees) {
     if( verbose )
@@ -115,8 +116,8 @@ m_randomForest = function(Matrix[Double] X, Matrix[Double] 
y, Matrix[Double] cty
     # step 3: train decision tree
     t2 = time();
     si3 = as.integer(as.scalar(randSeeds[3*(i-1)+3,1]));
-    Mtemp = decisionTree(X=Xi, Y=yi, R=ctypes, depth=max_depth);
-    # TODO add min_leaf=min_leaf, max_features = max_features, 
impurity=impurity, seed=si3, verbose=verbose);
+    Mtemp = decisionTree(X=Xi, y=yi, ctypes=ctypes, max_depth=max_depth,
+      min_leaf=min_leaf, max_features=max_features, impurity=impurity, 
seed=si3, verbose=verbose);
     M[i,1:length(Mtemp)] = matrix(Mtemp, rows=1, cols=length(Mtemp));
     if( verbose )
       print("-- ["+i+"] trained decision tree in "+(time()-t2)/1e9+" 
seconds.");
diff --git a/scripts/builtin/randomForestPredict.dml 
b/scripts/builtin/randomForestPredict.dml
index 1e08acb6ac..5923ef71f2 100644
--- a/scripts/builtin/randomForestPredict.dml
+++ b/scripts/builtin/randomForestPredict.dml
@@ -43,7 +43,7 @@ m_randomForestPredict = function(Matrix[Double] X, 
Matrix[Double] y = matrix(0,0
   return(Matrix[Double] yhat)
 {
   t1 = time();
-  classify = FALSE; # TODO as.scalar(ctypes[1,ncol(X)+1]) == 2;
+  classify = as.scalar(ctypes[1,ncol(X)+1]) == 2;
   yExists = (nrow(X)==nrow(y));
 
   if(verbose) {
@@ -63,7 +63,7 @@ m_randomForestPredict = function(Matrix[Double] X, 
Matrix[Double] y = matrix(0,0
 
     # step 2: score decision tree
     t2 = time();
-    ret = decisionTreePredict(X=Xi, M=M[i,], strategy="TT");
+    ret = decisionTreePredict(X=Xi, M=M[i,ncol(X)+1:ncol(M)], ctypes=ctypes, 
strategy="TT");
     Ytmp[i,1:nrow(ret)] = t(ret);
     if( verbose )
       print("-- ["+i+"] scored decision tree in "+(time()-t2)/1e9+" seconds.");
diff --git 
a/src/test/java/org/apache/sysds/test/functions/builtin/part1/BuiltinDecisionTreePredictTest.java
 
b/src/test/java/org/apache/sysds/test/functions/builtin/part1/BuiltinDecisionTreePredictTest.java
index 187529e7d7..0cecc0e15c 100644
--- 
a/src/test/java/org/apache/sysds/test/functions/builtin/part1/BuiltinDecisionTreePredictTest.java
+++ 
b/src/test/java/org/apache/sysds/test/functions/builtin/part1/BuiltinDecisionTreePredictTest.java
@@ -27,6 +27,7 @@ import org.apache.sysds.runtime.matrix.data.MatrixValue;
 import org.apache.sysds.test.AutomatedTestBase;
 import org.apache.sysds.test.TestConfiguration;
 import org.apache.sysds.test.TestUtils;
+import org.junit.Ignore;
 import org.junit.Test;
 
 public class BuiltinDecisionTreePredictTest extends AutomatedTestBase {
@@ -52,11 +53,13 @@ public class BuiltinDecisionTreePredictTest extends 
AutomatedTestBase {
        }
        
        @Test
+       @Ignore
        public void testDecisionTreeGEMMPredictDefaultCP() {
                runDecisionTreePredict(true, ExecType.CP, "GEMM");
        }
 
        @Test
+       @Ignore
        public void testDecisionTreeGEMMPredictSP() {
                runDecisionTreePredict(true, ExecType.SPARK, "GEMM");
        }
@@ -70,16 +73,21 @@ public class BuiltinDecisionTreePredictTest extends 
AutomatedTestBase {
                        fullDMLScriptName = HOME + TEST_NAME + ".dml";
                        programArgs = new String[] {"-args", input("M"), 
input("X"), strategy, output("Y")};
 
-                       double[][] X = {{0.5, 7, 0.1}, {0.5, 7, 0.7}, {-1, 
-0.2, 3}, {-1, -0.2, -0.8}, {-0.3, -0.7, 3}};
-                       double[][] M = {{1, 2, 3, 4, 5, 6, 7}, {1, 2, 3, 0, 0, 
0, 0}, {1, 2, 3, 0, 0, 0, 0},
-                               {1, 1, 1, 4, 5, 6, 7}, {1, 1, 1, 0, 0, 0, 0}, 
{0, -0.5, 0.5, 0, 0, 0, 0}};
-
+                       //data and model consistent with decision tree test
+                       double[][] X = {
+                                       {3, 1, 2, 1, 5}, 
+                                       {2, 1, 2, 2, 4}, 
+                                       {1, 1, 1, 3, 3},
+                                       {4, 2, 1, 4, 2}, 
+                                       {2, 2, 1, 5, 1},};
+                       double[][] M = {{1.0, 2.0, 0.0, 1.0, 0.0, 2.0}};
+                       
                        HashMap<MatrixValue.CellIndex, Double> expected_Y = new 
HashMap<>();
-                       expected_Y.put(new MatrixValue.CellIndex(1, 1), 6.0);
-                       expected_Y.put(new MatrixValue.CellIndex(1, 2), 7.0);
-                       expected_Y.put(new MatrixValue.CellIndex(1, 3), 5.0);
-                       expected_Y.put(new MatrixValue.CellIndex(1, 4), 5.0);
-                       expected_Y.put(new MatrixValue.CellIndex(1, 5), 4.0);
+                       expected_Y.put(new MatrixValue.CellIndex(1, 1), 2.0);
+                       expected_Y.put(new MatrixValue.CellIndex(2, 1), 1.0);
+                       expected_Y.put(new MatrixValue.CellIndex(3, 1), 1.0);
+                       expected_Y.put(new MatrixValue.CellIndex(4, 1), 2.0);
+                       expected_Y.put(new MatrixValue.CellIndex(5, 1), 1.0);
 
                        writeInputMatrixWithMTD("M", M, true);
                        writeInputMatrixWithMTD("X", X, true);
@@ -87,7 +95,6 @@ public class BuiltinDecisionTreePredictTest extends 
AutomatedTestBase {
                        runTest(true, false, null, -1);
 
                        HashMap<MatrixValue.CellIndex, Double> actual_Y = 
readDMLMatrixFromOutputDir("Y");
-
                        TestUtils.compareMatrices(expected_Y, actual_Y, eps, 
"Expected-DML", "Actual-DML");
                }
                finally {
diff --git 
a/src/test/java/org/apache/sysds/test/functions/builtin/part1/BuiltinDecisionTreeTest.java
 
b/src/test/java/org/apache/sysds/test/functions/builtin/part1/BuiltinDecisionTreeTest.java
index 2c83275317..f8ac8397cb 100644
--- 
a/src/test/java/org/apache/sysds/test/functions/builtin/part1/BuiltinDecisionTreeTest.java
+++ 
b/src/test/java/org/apache/sysds/test/functions/builtin/part1/BuiltinDecisionTreeTest.java
@@ -60,30 +60,28 @@ public class BuiltinDecisionTreeTest extends 
AutomatedTestBase {
                        fullDMLScriptName = HOME + TEST_NAME + ".dml";
                        programArgs = new String[] {"-args", input("X"), 
input("Y"), input("R"), output("M")};
 
-                       double[][] Y = {{1.0}, {0.0}, {0.0}, {1.0}, {0.0}};
-
-                       double[][] X = {{4.5, 4.0, 3.0, 2.8, 3.5}, {1.9, 2.4, 
1.0, 3.4, 2.9}, {2.0, 1.1, 1.0, 4.9, 3.4},
-                               {2.3, 5.0, 2.0, 1.4, 1.8}, {2.1, 1.1, 3.0, 1.0, 
1.9},};
+                       double[][] Y = {{2.0}, {1.0}, {1.0}, {2.0}, {1.0}};
+                       double[][] X = {
+                               {3, 1, 2, 1, 5}, 
+                               {2, 1, 2, 2, 4}, 
+                               {1, 1, 1, 3, 3},
+                               {4, 2, 1, 4, 2}, 
+                               {2, 2, 1, 5, 1},};
+                       double[][] R = {{1.0, 1.0, 2.0, 1.0, 1.0, 1.0},};
                        writeInputMatrixWithMTD("X", X, true);
                        writeInputMatrixWithMTD("Y", Y, true);
-
-                       double[][] R = {{1.0, 1.0, 3.0, 1.0, 1.0},};
                        writeInputMatrixWithMTD("R", R, true);
 
                        runTest(true, false, null, -1);
 
                        HashMap<MatrixValue.CellIndex, Double> actual_M = 
readDMLMatrixFromOutputDir("M");
                        HashMap<MatrixValue.CellIndex, Double> expected_M = new 
HashMap<>();
-                       expected_M.put(new MatrixValue.CellIndex(1, 1), 1.0);
-                       expected_M.put(new MatrixValue.CellIndex(1, 3), 3.0);
-                       expected_M.put(new MatrixValue.CellIndex(3, 1), 2.0);
-                       expected_M.put(new MatrixValue.CellIndex(1, 2), 2.0);
-                       expected_M.put(new MatrixValue.CellIndex(2, 1), 1.0);
-                       expected_M.put(new MatrixValue.CellIndex(5, 1), 1.0);
-                       expected_M.put(new MatrixValue.CellIndex(4, 1), 1.0);
-                       expected_M.put(new MatrixValue.CellIndex(5, 3), 1.0);
-                       expected_M.put(new MatrixValue.CellIndex(5, 2), 1.0);
-                       expected_M.put(new MatrixValue.CellIndex(6, 1), 3.2);
+                       expected_M.put(new MatrixValue.CellIndex(1, 1), 1.0); 
//split feature 1
+                       expected_M.put(new MatrixValue.CellIndex(1, 2), 2.0); 
// <= 2
+                       expected_M.put(new MatrixValue.CellIndex(1, 3), 0.0); 
//left leaf node
+                       expected_M.put(new MatrixValue.CellIndex(1, 4), 1.0); 
// class 1
+                       expected_M.put(new MatrixValue.CellIndex(1, 5), 0.0); 
//right leaf node
+                       expected_M.put(new MatrixValue.CellIndex(1, 6), 2.0); 
// class 2
 
                        TestUtils.compareMatrices(expected_M, actual_M, eps, 
"Expected-DML", "Actual-DML");
                }
diff --git a/src/test/scripts/functions/builtin/decisionTree.dml 
b/src/test/scripts/functions/builtin/decisionTree.dml
index 31bd8206e6..829451d02b 100644
--- a/src/test/scripts/functions/builtin/decisionTree.dml
+++ b/src/test/scripts/functions/builtin/decisionTree.dml
@@ -21,6 +21,9 @@
 
 X = read($1);
 Y = read($2);
-R = read($3)
-M = decisionTree(X = X, Y = Y, R = R);
+R = read($3);
+
+M = decisionTree(X = X, y = Y, ctypes = R,
+  max_features=1, min_split=4, min_leaf=2, verbose=TRUE);
+
 write(M, $4);
diff --git a/src/test/scripts/functions/builtin/decisionTreePredict.dml 
b/src/test/scripts/functions/builtin/decisionTreePredict.dml
index 208a8274ba..e87b01c581 100644
--- a/src/test/scripts/functions/builtin/decisionTreePredict.dml
+++ b/src/test/scripts/functions/builtin/decisionTreePredict.dml
@@ -21,5 +21,6 @@
 
 M = read($1);
 X = read($2);
-Y = decisionTreePredict(M = M, X = X, strategy = $3);
+# FIXME reordering of M and X yields wrong passing
+Y = decisionTreePredict(M=M, X=X, ctypes=matrix(2,1,ncol(X)+1), strategy=$3);
 write(Y, $4);
diff --git a/src/test/scripts/functions/builtin/randomForestTest.dml 
b/src/test/scripts/functions/builtin/randomForestTest.dml
index 25fa54eea8..971c010662 100644
--- a/src/test/scripts/functions/builtin/randomForestTest.dml
+++ b/src/test/scripts/functions/builtin/randomForestTest.dml
@@ -36,7 +36,7 @@ jspec = "{ids: true, bin: ["
   + "{id: 7, method: equi-width, numbins: 10}]}";
 [X,D] = transformencode(target=F, spec=jspec);
 
-R = matrix(1, rows=1, cols=ncol(X));
+R = matrix(1, rows=1, cols=ncol(X)+1);
 M = randomForest(X=X, y=Y, ctypes=R, num_trees=num_trees, seed=7,
   max_depth=depth, min_leaf=num_leafs, impurity=impurity, verbose=TRUE);
 randomForestPredict(X=X, y=Y, ctypes=R, M=M, verbose=TRUE);

[systemds] branch main updated: [SYSTEMDS-3277] Rework and cleanup decisionTree and decisionTreePredict

Reply via email to