[systemds] branch main updated: [SYSTEMDS-3505] Rework randomForest builtin function and tests

mboehm7 Sun, 12 Mar 2023 11:27:15 -0700

This is an automated email from the ASF dual-hosted git repository.

mboehm7 pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/systemds.git



The following commit(s) were added to refs/heads/main by this push:
     new 8d6a1cbdba [SYSTEMDS-3505] Rework randomForest builtin function and 
tests
8d6a1cbdba is described below

commit 8d6a1cbdba151d8f3f5fca14d57c2e7a6044728a
Author: Matthias Boehm <[email protected]>
AuthorDate: Sun Mar 12 19:24:25 2023 +0100

    [SYSTEMDS-3505] Rework randomForest builtin function and tests
    
    The randomForest builtin function and related algorithms are broken
    for a long time. Originally, these algorithms were written to leverage
    similar ideas like PLANET (with queues for large and small scans). As
    part of a general overhaul of the randomForest and decisionTree builtins
    this patch reworks the randomForest function by several improvements:
      * transform encode outside randomForest/decisionTree
      * simple data and feature sampling
      * calls to the existing decision tree builtin
      * simpler data representation of trained decision trees
      * working tests and improved documentation
---
 scripts/builtin/randomForest.dml                   | 1333 ++------------------
 scripts/builtin/slicefinder.dml                    |    4 +-
 .../builtin/part2/BuiltinRandomForestTest.java     |   22 +-
 .../scripts/functions/builtin/randomForest.dml     |   29 +-
 4 files changed, 117 insertions(+), 1271 deletions(-)

diff --git a/scripts/builtin/randomForest.dml b/scripts/builtin/randomForest.dml
index bf26703508..df130d3c80 100644
--- a/scripts/builtin/randomForest.dml
+++ b/scripts/builtin/randomForest.dml
@@ -19,1268 +19,111 @@
 #
 #-------------------------------------------------------------
 
-# This script implement classification random forest with both scale and 
categorical features.
+# This script implements random forest for recoded and binned categorical and
+# numerical input features. In detail, we train multiple CART (classification
+# and regression trees) decision trees in parallel and use them as an ensemble.
+# classifier/regressor. Each tree is trained on a sample of observations (rows)
+# and optionally subset of features (columns). During tree construction, split
+# candidates are additionally chosen on a sample of remaining features.
 #
 # INPUT:
-# 
----------------------------------------------------------------------------------------
-# X               Feature matrix X; note that X needs to be both recoded and 
dummy coded
-# Y               Label matrix Y; note that Y needs to be both recoded and 
dummy coded
-# R               Matrix which for each feature in X contains the following 
information
-#                 - R[,1]: column ids       TODO pass recorded and binned
-#                 - R[,2]: start indices
-#                 - R[,3]: end indices
-#                 If R is not provided by default all variables are assumed to 
be scale
-# bins            Number of equiheight bins per scale feature to choose 
thresholds
-# depth           Maximum depth of the learned tree
-# num_leaf        Number of samples when splitting stops and a leaf node is 
added
-# num_samples     Number of samples at which point we switch to in-memory 
subtree building
+# 
------------------------------------------------------------------------------
+# X               Feature matrix in recoded/binned representation
+# y               Label matrix in recoded/binned representation
+# ctypes          Row-Vector of column types [1 scale/ordinal, 2 categorical]
 # num_trees       Number of trees to be learned in the random forest model
-# subsamp_rate    Parameter controlling the size of each tree in the forest; 
samples are selected from a
-#                 Poisson distribution with parameter subsamp_rate (the 
default value is 1.0)
-# feature_subset  Parameter that controls the number of feature used as 
candidates for splitting at each tree node
-#                 as a power of number of features in the dataset;
-#                 by default square root of features (i.e., feature_subset = 
0.5) are used at each tree node
-# impurity        Impurity measure: entropy or Gini (the default)
-# 
----------------------------------------------------------------------------------------
+# sample_frac     Sample fraction of examples for each tree in the forest
+# feature_frac    Sample fraction of features for each tree in the forest
+# max_depth       Maximum depth of the learned tree (stopping criterion)
+# min_leaf        Minimum number of samples in leaf nodes (stopping criterion)
+# max_features    Parameter controlling the number of features used as split
+#                 candidates at tree nodes: m = ceil(num_features^max_features)
+# impurity        Impurity measure: entropy, gini (default)
+# seed            Fixed seed for randomization of samples and split candidates
+# verbose         Flag indicating verbose debug output
+# 
------------------------------------------------------------------------------
 #
 # OUTPUT:
-# 
--------------------------------------------------------------------------------------------
-# M      Matrix M containing the learned tree, where each column corresponds 
to a node
-#        in the learned tree and each row contains the following information:
-#          M[1,j]: id of node j (in a complete binary tree)
-#          M[2,j]: tree id to which node j belongs
-#          M[3,j]: Offset (no. of columns) to left child of j
-#          M[4,j]: Feature index of the feature that node j looks at if j is 
an internal node, otherwise 0
-#          M[5,j]: Type of the feature that node j looks at if j is an 
internal node: 1 for scale and 2
-#          for categorical features,
-#          otherwise the label that leaf node j is supposed to predict
-#          M[6,j]: 1 if j is an internal node and the feature chosen for j is 
scale, otherwise the
-#          size of the subset of values
-#          stored in rows 7,8,... if j is categorical
-#          M[7:,j]: Only applicable for internal nodes. Threshold the 
example's feature value is
-#          compared to is stored at M[7,j] if the feature chosen for j is 
scale;
-#          If the feature chosen for j is categorical rows 7,8,... depict the 
value subset chosen for j
-# C      Matrix C containing the number of times samples are chosen in each 
tree of the random forest
-# S_map  Mappings from scale feature ids to global feature ids
-# C_map  Mappings from categorical feature ids to global feature ids
-# 
--------------------------------------------------------------------------------------------
+# 
------------------------------------------------------------------------------
+# M              Matrix M containing the learned trees, in linearized form
+#                For example, give a feature matrix with features [a,b,c,d]
+#                and the following two trees, M would look as follows:
+#
+#                (L1)          |a<7|                   |d<5|
+#                             /     \                 /     \
+#                (L2)     |c<3|     |b<4|         |a<7|     P3:2
+#                         /   \     /   \         /   \
+#                (L3)   P1:2 P2:1 P3:1 P4:2     P1:2 P2:1
+#
+#                --> M :=
+#                [[1, 7, 3, 3, 2, 4, 0, 2, 0, 1, 0, 1, 0, 2],  (1st tree)
+#                 [4, 5, 1, 7, 0, 2, 0, 2, 0, 1, 0, 0, 0, 0]]  (2nd tree)
+#                 |(L1)| |  (L2)   | |        (L3)         |
+#
+#                With feature sampling (feature_frac < 1), each tree is
+#                prefixed by a one-hot vector of sampled features
+#                (e.g., [1,1,1,0] if we sampled a,b,c of the four features)
+# 
------------------------------------------------------------------------------
 
-m_randomForest = function(Matrix[Double] X, Matrix[Double] Y, Matrix[Double] 
R, 
-    Integer bins = 20, Integer depth = 25, Integer num_leaf = 10, Integer 
num_samples = 3000, 
-    Integer num_trees = 1, Double subsamp_rate = 1.0, Double feature_subset = 
0.5, String impurity = "Gini")
-  return(Matrix[Double] M, Matrix[Double] C, Matrix[Double] S_map, 
Matrix[Double] C_map)
+m_randomForest = function(Matrix[Double] X, Matrix[Double] y, Matrix[Double] 
ctypes,
+    Int num_trees = 16, Double sample_frac = 0.1, Double feature_frac = 1.0,
+    Int max_depth = 10, Int min_leaf = 20, Double max_features = 0.5,
+    String impurity = "gini", Int seed = -1, Boolean verbose = FALSE)
+  return(Matrix[Double] M)
 {
-  print("started random-forest ...")
-  
-  num_bins = bins;
-  threshold = num_samples;
-  imp = impurity;
-  fpow = feature_subset;
-  rate = subsamp_rate;
-  
-  M = matrix(0, rows = 10, cols = 10);
-  
-  Y_bin = Y;
-  num_records = nrow (X);
-  num_classes = ncol (Y_bin);
-  
-  # check if there is only one class label
-  Y_bin_sum = sum (colSums (Y_bin) == num_records);
-  if (Y_bin_sum == 1)
-    stop ("Y contains only one class label. No model will be learned!");
-  else if (Y_bin_sum > 1)
-    stop ("Y is not properly dummy coded. Multiple columns of Y contain only 
ones!")
-  
-  # split data into X_scale and X_cat
-  
-  if (length(R) != 0) {
-    R_tmp = order (target = R, by = 2); # sort by start indices
-    dummy_coded = (R_tmp[,2] != R_tmp[,3]);
-    R_scale = removeEmpty (target = R_tmp[,2:3] * (1 - dummy_coded), margin = 
"rows");
-    R_cat = removeEmpty (target = R_tmp[,2:3] * dummy_coded, margin = "rows");
-    S_map = removeEmpty (target = (1 - dummy_coded) * seq (1, nrow (R_tmp)), 
margin = "rows");
-    C_map = removeEmpty (target = dummy_coded * seq (1, nrow (R_tmp)), margin 
= "rows");
-    
-    sum_dummy = sum (dummy_coded);
-    if (sum_dummy == nrow (R_tmp)) { # all features categorical
-      print ("All features categorical");
-      num_cat_features = nrow (R_cat);
-      num_scale_features = 0;
-      X_cat = X;
-      distinct_values = t (R_cat[,2] - R_cat[,1] + 1);
-      distinct_values_max = max (distinct_values);
-      distinct_values_offset = cumsum (t (distinct_values));
-      distinct_values_overall = sum (distinct_values);
-    } else if (sum_dummy == 0) { # all features scale
-      print ("All features scale");
-      num_scale_features = ncol (X);
-      num_cat_features = 0;
-      X_scale = X;
-      distinct_values_max = 1;
-    } else { # some features scale some features categorical
-      num_cat_features = nrow (R_cat);
-      num_scale_features = nrow (R_scale);
-      distinct_values = t (R_cat[,2] - R_cat[,1] + 1);
-      distinct_values_max = max (distinct_values);
-      distinct_values_offset = cumsum (t (distinct_values));
-      distinct_values_overall = sum (distinct_values);
-  
-      W = matrix (1, rows = num_cat_features, cols = 1) %*% matrix ("1 -1", 
rows = 1, cols = 2);
-      W = matrix (W, rows = 2 * num_cat_features, cols = 1);
-      if (as.scalar (R_cat[num_cat_features, 2]) == ncol (X))
-        W[2 * num_cat_features,] = 0;
-  
-      last = (R_cat[,2] != ncol (X));
-      R_cat1 = (R_cat[,2] + 1) * last;
-      R_cat[,2] = (R_cat[,2] * (1 - last)) + R_cat1;
-      R_cat_vec = matrix (R_cat, rows = 2 * num_cat_features, cols = 1);
-  
-      col_tab = table (R_cat_vec, 1, W, ncol (X), 1);
-      col_ind = cumsum (col_tab);
-  
-      col_ind_cat = removeEmpty (target = col_ind * seq (1, ncol (X)), margin 
= "rows");
-      col_ind_scale = removeEmpty (target = (1 - col_ind) * seq (1, ncol (X)), 
margin = "rows");
-      X_cat = X %*% table (col_ind_cat, seq (1, nrow (col_ind_cat)), ncol (X), 
nrow (col_ind_cat));
-      X_scale = X %*% table (col_ind_scale, seq (1, nrow (col_ind_scale)), 
ncol (X), nrow (col_ind_scale));
-    }
-  } else { # only scale features exist
-    print ("All features scale");
-    num_scale_features = ncol (X);
-    num_cat_features = 0;
-    X_scale = X;
-    distinct_values_max = 1;
-  }
-  
-  if (num_scale_features > 0) {
-    print ("COMPUTING BINNING...");
-    bin_size = max (as.integer (num_records / num_bins), 1);
-    count_thresholds = matrix (0, rows = 1, cols = num_scale_features)
-    thresholds = matrix (0, rows = num_bins+1, cols = num_scale_features)
-    parfor(i1 in 1:num_scale_features) {
-      #approximate equi-height binning 
-      bbin = seq(0, 1, 1/num_bins);
-      col_bins = quantile(X_scale[,i1], bbin)
-      count_thresholds[,i1] = num_bins; #TODO probe empty
-      thresholds[,i1] = col_bins;
-    }
+  t1 = time();
 
-    print ("PREPROCESSING SCALE FEATURE MATRIX...");
-    min_num_bins = min (count_thresholds);
-    max_num_bins = max (count_thresholds);
-    total_num_bins = sum (count_thresholds);
-    cum_count_thresholds = t (cumsum (t (count_thresholds)));
-    X_scale_ext = matrix (0, rows = num_records, cols = total_num_bins);
-    parfor (i2 in 1:num_scale_features, check = 0) {
-      Xi2 = X_scale[,i2];
-      count_threshold = as.scalar (count_thresholds[,i2]);
-      offset_feature = 1;
-      if (i2 > 1)
-        offset_feature = offset_feature + as.integer (as.scalar 
(cum_count_thresholds[, (i2 - 1)]));
-      ti2 = t(thresholds[1:count_threshold, i2]);
-      X_scale_ext[,offset_feature:(offset_feature + count_threshold - 1)] = 
outer (Xi2, ti2, "<");
-    }
+  # validation and initialization of reproducible seeds
+  if(verbose) {
+    print("randomForest: initialize with num_trees=" + num_trees + ", 
sample_frac=" + sample_frac
+      + ", feature_frac=" + feature_frac + ", impurity=" + impurity + ", 
seed=" + seed + ".");
   }
+  if(ncol(ctypes) != ncol(X))
+    stop("randomForest: inconsistent num features and col types: "+ncol(X)+" 
vs "+ncol(ctypes)+".");
+  if(sum(y <= 0) != 0)
+    stop("randomForest: y is not properly recoded and binned (contiguous 
positive integers).");
+  if(max(y) == 1)
+    stop("randomForest: y contains only one class label.");
 
-  num_features_total = num_scale_features + num_cat_features;
-  num_feature_samples = as.integer (floor (num_features_total ^ fpow));
-  
-  
#-------------------------------------------------------------------------------------
-  
-  ##### INITIALIZATION
-  L = matrix (1, rows = num_records, cols = num_trees); # last visited node id 
for each training sample
-  
-  # create matrix of counts (generated by Poisson distribution) storing how 
many times each sample appears in each tree
-  print ("CONPUTING COUNTS...");
-  C = rand (rows = num_records, cols = num_trees, pdf = "poisson", lambda = 
rate);
-  Ix_nonzero = (C != 0);
-  L = L * Ix_nonzero;
-  total_counts = sum (C);
+  lseed = as.integer(ifelse(seed!=-1, seed, 
as.scalar(rand(rows=1,cols=1,min=0, max=1e9))));
+  randSeeds = rand(rows = 3 * num_trees, cols = 1, seed=lseed, min=0, max=1e9);
 
-  # model
-  # LARGE leaf nodes
-  # NC_large[,1]: node id
-  # NC_large[,2]: tree id
-  # NC_large[,3]: class label
-  # NC_large[,4]: no. of misclassified samples
-  # NC_large[,5]: 1 if special leaf (impure and 3 samples at that leaf > 
threshold) or 0 otherwise
-  NC_large = matrix (0, rows = 5, cols = 1);
-  
-  # SMALL leaf nodes
-  # same schema as for LARGE leaf nodes (to be initialized)
-  NC_small = matrix (0, rows = 5, cols = 1);
-  
-  # LARGE internal nodes
-  # Q_large[,1]: node id
-  # Q_large[,2]: tree id
-  Q_large = matrix (0, rows = 2, cols = num_trees);
-  Q_large[1,] = matrix (1, rows = 1, cols = num_trees);
-  Q_large[2,] = t (seq (1, num_trees));
-  
-  # SMALL internal nodes
-  # same schema as for LARGE internal nodes (to be initialized)
-  Q_small = matrix (0, rows = 2, cols = 1);
-  
-  # F_large[,1]: feature
-  # F_large[,2]: type
-  # F_large[,3]: offset
-  F_large = matrix (0, rows = 3, cols = 1);
-  
-  # same schema as for LARGE nodes
-  F_small = matrix (0, rows = 3, cols = 1);
-  
-  # split points for LARGE internal nodes
-  S_large = matrix (0, rows = 1, cols = 1);
-  
-  # split points for SMALL internal nodes
-  S_small = matrix (0, rows = 1, cols = 1);
-  
-  # initialize queue
-  cur_nodes_large = matrix (1, rows = 2, cols = num_trees);
-  cur_nodes_large[2,] = t (seq (1, num_trees));
-  
-  num_cur_nodes_large = num_trees;
-  num_cur_nodes_small = 0;
-  level = 0;
-  
-  while ((num_cur_nodes_large + num_cur_nodes_small) > 0 & level < depth) {
-    level = level + 1;
-    print (" --- start level " + level + " --- ");
-  
-    ##### PREPARE MODEL
-    if (num_cur_nodes_large > 0) { # LARGE nodes to process
-      cur_Q_large = matrix (0, rows = 2, cols = 2 * num_cur_nodes_large);
-      cur_NC_large = matrix (0, rows = 5, cols = 2 * num_cur_nodes_large);
-      cur_F_large = matrix (0, rows = 3, cols = num_cur_nodes_large);
-      cur_S_large = matrix (0, rows = 1, cols = num_cur_nodes_large * 
distinct_values_max);
-      cur_nodes_small = matrix (0, rows = 3, cols = 2 * num_cur_nodes_large);
-    }
-  
-    ##### LOOP OVER LARGE NODES...
-    parfor (i6 in 1:num_cur_nodes_large, check = 0) {  
-      cur_node = as.scalar (cur_nodes_large[1,i6]);
-      cur_tree = as.scalar (cur_nodes_large[2,i6]);
-  
-      # select sample features WOR
-      feature_samples = sample (num_features_total, num_feature_samples);
-      feature_samples = order (target = feature_samples, by = 1);
-      num_scale_feature_samples = sum (feature_samples <= num_scale_features);
-      num_cat_feature_samples = num_feature_samples - 
num_scale_feature_samples;
-  
-      # --- find best split ---
-      # samples that reach cur_node
-      Ix = (L[,cur_tree] == cur_node);
-  
-      cur_Y_bin = Y_bin * (Ix * C[,cur_tree]);
-      label_counts_overall = colSums (cur_Y_bin);
-      label_sum_overall = sum (label_counts_overall);
-      label_dist_overall = label_counts_overall / label_sum_overall;
-  
-      if (imp == "entropy") {
-        label_dist_zero = (label_dist_overall == 0);
-        cur_impurity = - sum (label_dist_overall * log (label_dist_overall + 
label_dist_zero)); # / log (2); # impurity before
-      } else { # imp == "Gini"
-        cur_impurity = sum (label_dist_overall * (1 - label_dist_overall)); # 
impurity before
-      }
-      best_scale_gain = 0;
-      best_cat_gain = 0;
-      if (num_scale_features > 0 & num_scale_feature_samples > 0) {
-        scale_feature_samples = feature_samples[1:num_scale_feature_samples,];
-  
-        # main operation
-        label_counts_left_scale = t (t (cur_Y_bin) %*% X_scale_ext);
-  
-        # compute left and right label distribution
-        label_sum_left = rowSums (label_counts_left_scale);
-        label_dist_left = label_counts_left_scale / label_sum_left;
-        if (imp == "entropy") {
-          label_dist_left = replace (target = label_dist_left, pattern = 0, 
replacement = 1);
-          log_label_dist_left = log (label_dist_left); # / log (2)
-          impurity_left_scale = - rowSums (label_dist_left * 
log_label_dist_left);
-        } else { # imp == "Gini"
-          impurity_left_scale = rowSums (label_dist_left * (1 - 
label_dist_left));
-        }
-        #
-        label_counts_right_scale = - label_counts_left_scale + 
label_counts_overall;
-        label_sum_right = rowSums (label_counts_right_scale);
-        label_dist_right = label_counts_right_scale / label_sum_right;
-        if (imp == "entropy") {
-          label_dist_right = replace (target = label_dist_right, pattern = 0, 
replacement = 1);
-          log_label_dist_right = log (label_dist_right); # / log (2)
-          impurity_right_scale = - rowSums (label_dist_right * 
log_label_dist_right);
-        } else { # imp == "Gini"
-          impurity_right_scale = rowSums (label_dist_right * (1 - 
label_dist_right));
-        }
-  
-        I_gain_scale = cur_impurity - ( ( label_sum_left / label_sum_overall ) 
* impurity_left_scale + ( label_sum_right / label_sum_overall ) * 
impurity_right_scale);
-  
-        I_gain_scale = replace (target = I_gain_scale, pattern = NaN, 
replacement = 0);
-  
-        # determine best feature to split on and the split value
-        feature_start_ind = matrix (0, rows = 1, cols = num_scale_features);
-        feature_start_ind[1,1] = 1;
-        if (num_scale_features > 1) {
-          feature_start_ind[1,2:num_scale_features] = 
cum_count_thresholds[1,1:(num_scale_features - 1)] + 1;
-        }
-        max_I_gain_found = 0;
-        max_I_gain_found_ind = 0;
-        best_i = 0;
-  
-        for (i in 1:num_scale_feature_samples) { # assuming feature_samples is 
5x1
-          cur_feature_samples_bin = as.scalar (scale_feature_samples[i,]);
-          cur_start_ind = as.scalar 
(feature_start_ind[,cur_feature_samples_bin]);
-          cur_end_ind = as.scalar 
(cum_count_thresholds[,cur_feature_samples_bin]);
-          I_gain_portion = I_gain_scale[cur_start_ind:cur_end_ind,];
-          cur_max_I_gain = max (I_gain_portion);
-          cur_max_I_gain_ind = as.scalar (rowIndexMax (t (I_gain_portion)));
-          if (cur_max_I_gain > max_I_gain_found) {
-            max_I_gain_found = cur_max_I_gain;
-            max_I_gain_found_ind = cur_max_I_gain_ind;
-            best_i = i;
-          }
-        }
-  
-        best_scale_gain = max_I_gain_found;
-        max_I_gain_ind_scale = max_I_gain_found_ind;
-        best_scale_feature = 0;
-        if (best_i > 0) {
-          best_scale_feature = as.scalar (scale_feature_samples[best_i,]);
-        }
-        best_scale_split = max_I_gain_ind_scale;
-        if (best_scale_feature > 1) {
-          best_scale_split = best_scale_split + 
as.scalar(cum_count_thresholds[,(best_scale_feature - 1)]);
-        }
-      }
-  
-      if (num_cat_features > 0 & num_cat_feature_samples > 0){
-        cat_feature_samples = feature_samples[(num_scale_feature_samples + 
1):(num_scale_feature_samples + num_cat_feature_samples),] - num_scale_features;
-  
-        # initialization
-        split_values_bin = matrix (0, rows = 1, cols = 
distinct_values_overall);
-        split_values = split_values_bin;
-        split_values_offset = matrix (0, rows = 1, cols = num_cat_features);
-        I_gains = split_values_offset;
-        impurities_left = split_values_offset;
-        impurities_right = split_values_offset;
-        best_label_counts_left = matrix (0, rows = num_cat_features, cols = 
num_classes);
-        best_label_counts_right = matrix (0, rows = num_cat_features, cols = 
num_classes);
-  
-        # main operation
-        label_counts = t (t (cur_Y_bin) %*% X_cat);
-  
-        parfor (i9 in 1:num_cat_feature_samples, check = 0){
-          cur_cat_feature = as.scalar (cat_feature_samples[i9,1]);
-          start_ind = 1;
-          if (cur_cat_feature > 1)
-            start_ind = start_ind + as.scalar 
(distinct_values_offset[(cur_cat_feature - 1),]);
-          offset = as.scalar (distinct_values[1,cur_cat_feature]);
-          cur_label_counts = label_counts[start_ind:(start_ind + offset - 1),];
-          label_sum = rowSums (cur_label_counts);
-          label_dist = cur_label_counts / label_sum;
-          if (imp == "entropy") {
-            label_dist = replace (target = label_dist, pattern = 0, 
replacement = 1);
-            log_label_dist = log (label_dist); # / log(2)
-            impurity_tmp = - rowSums (label_dist * log_label_dist);
-            impurity_tmp = replace (target = impurity_tmp, pattern = NaN, 
replacement = 1/0);
-          } else { # imp == "Gini"
-            impurity_tmp = rowSums (label_dist * (1 - label_dist));
-          }
-  
-          # sort cur feature by impurity
-          cur_distinct_values = seq (1, nrow (cur_label_counts));
-          cur_distinct_values_impurity = cbind (cur_distinct_values, 
impurity_tmp);
-          cur_feature_sorted = order (target = cur_distinct_values_impurity, 
by = 2, decreasing = FALSE);
-          P = table (cur_distinct_values, cur_feature_sorted); # permutation 
matrix
-          label_counts_sorted = P %*% cur_label_counts;
-  
-          # compute left and right label distribution
-          label_counts_left = cumsum (label_counts_sorted);
-  
-          label_sum_left = rowSums (label_counts_left);
-          label_dist_left = label_counts_left / label_sum_left;
-          label_dist_left = replace (target = label_dist_left, pattern = NaN, 
replacement = 1);
-          if (imp == "entropy") {
-            label_dist_left = replace (target = label_dist_left, pattern = 0, 
replacement = 1);
-            log_label_dist_left = log (label_dist_left); # / log(2)
-            impurity_left = - rowSums (label_dist_left * log_label_dist_left);
-          } else { # imp == "Gini"
-            impurity_left = rowSums (label_dist_left * (1 - label_dist_left));
-          }
-          #
-          label_counts_right = - label_counts_left + label_counts_overall;
-          label_sum_right = rowSums (label_counts_right);
-          label_dist_right = label_counts_right / label_sum_right;
-          label_dist_right = replace (target = label_dist_right, pattern = 
NaN, replacement = 1);
-          if (imp == "entropy") {
-            label_dist_right = replace (target = label_dist_right, pattern = 
0, replacement = 1);
-            log_label_dist_right = log (label_dist_right); # / log (2)
-            impurity_right = - rowSums (label_dist_right * 
log_label_dist_right);
-          } else { # imp == "Gini"
-            impurity_right = rowSums (label_dist_right * (1 - 
label_dist_right));
-          }
-          I_gain = cur_impurity - ( ( label_sum_left / label_sum_overall ) * 
impurity_left + ( label_sum_right / label_sum_overall ) * impurity_right);
-  
-          Ix_label_sum_left_zero = (label_sum_left == 0);
-          Ix_label_sum_right_zero = (label_sum_right == 0);
-          Ix_label_sum_zero = Ix_label_sum_left_zero * Ix_label_sum_right_zero;
-          I_gain = I_gain * (1 - Ix_label_sum_zero);
-  
-          I_gain[nrow (I_gain),] = 0; # last entry invalid
-  
-          max_I_gain_ind = as.scalar (rowIndexMax (t (I_gain)));
-  
-          split_values[1, start_ind:(start_ind + max_I_gain_ind - 1)] = t 
(cur_feature_sorted[1:max_I_gain_ind,1]);
-          for (i10 in 1:max_I_gain_ind) {
-            ind = as.scalar (cur_feature_sorted[i10,1]);
-            if (ind == 1)
-              split_values_bin[1,start_ind] = 1.0;
-            else
-              split_values_bin[1,(start_ind + ind - 1)] = 1.0;
-          }
-          split_values_offset[1,cur_cat_feature] = max_I_gain_ind;
-  
-          I_gains[1,cur_cat_feature] = max (I_gain);
-  
-          impurities_left[1,cur_cat_feature] = as.scalar 
(impurity_left[max_I_gain_ind,]);
-          impurities_right[1,cur_cat_feature] = as.scalar 
(impurity_right[max_I_gain_ind,]);
-          best_label_counts_left[cur_cat_feature,] = 
label_counts_left[max_I_gain_ind,];
-          best_label_counts_right[cur_cat_feature,] = 
label_counts_right[max_I_gain_ind,];
-        }
-  
-        # determine best feature to split on and the split values
-        best_cat_feature = as.scalar (rowIndexMax (I_gains));
-        best_cat_gain = max (I_gains);
-        start_ind = 1;
-        if (best_cat_feature > 1)
-          start_ind = start_ind + as.scalar 
(distinct_values_offset[(best_cat_feature - 1),]);
-        offset = as.scalar (distinct_values[1,best_cat_feature]);
-        best_split_values_bin = split_values_bin[1, start_ind:(start_ind + 
offset - 1)];
-      }
-  
-      # compare best scale feature to best cat. feature and pick the best one
-      if (num_scale_features > 0 & num_scale_feature_samples > 0 & 
best_scale_gain >= best_cat_gain & best_scale_gain > 0) {
-        # --- update model ---
-        cur_F_large[1,i6] = best_scale_feature;
-        cur_F_large[2,i6] = 1;
-        cur_F_large[3,i6] = 1;
-        cur_S_large[1,(i6 - 1) * distinct_values_max + 1] = 
thresholds[max_I_gain_ind_scale, best_scale_feature];
-        left_child = 2 * (cur_node - 1) + 1 + 1;
-        right_child = 2 * (cur_node - 1) + 2 + 1;
-  
-        # samples going to the left subtree
-        Ix_left = X_scale_ext[,best_scale_split];
-        Ix_left = Ix * Ix_left;
-        Ix_right = Ix * (1 - Ix_left);
-        L[,cur_tree] = L[,cur_tree] * (1 - Ix_left) + (Ix_left * left_child);
-        L[,cur_tree] = L[,cur_tree] * (1 - Ix_right) + (Ix_right * 
right_child);
-        left_child_size = sum (Ix_left * C[,cur_tree]);
-        right_child_size = sum (Ix_right * C[,cur_tree]);
-  
-        # check if left or right child is a leaf
-        left_pure = FALSE;
-        right_pure = FALSE;
-        cur_impurity_left = as.scalar(impurity_left_scale[best_scale_split,]); 
# max_I_gain_ind_scale
-        cur_impurity_right = 
as.scalar(impurity_right_scale[best_scale_split,]); # max_I_gain_ind_scale
-        if ( (left_child_size <= num_leaf | cur_impurity_left == 0 | (level == 
depth)) &
-           (right_child_size <= num_leaf | cur_impurity_right == 0 | (level == 
depth)) |
-           (left_child_size <= threshold & right_child_size <= threshold & 
(level == depth)) ) { # both left and right nodes are leaf
-  
-          cur_label_counts_left = label_counts_left_scale[best_scale_split,]; 
# max_I_gain_ind_scale
-          cur_NC_large[1,(2 * (i6 - 1) + 1)] = left_child;
-          cur_NC_large[2,(2 * (i6 - 1) + 1)] = cur_tree;
-          cur_NC_large[3,(2 * (i6 - 1) + 1)] = as.scalar( rowIndexMax 
(cur_label_counts_left)); # leaf class label
-          left_pure = TRUE;
-          # compute number of misclassified points
-          cur_NC_large[4,(2 * (i6 - 1) + 1)] = left_child_size - max 
(cur_label_counts_left);
-  
-          cur_label_counts_right = label_counts_overall - 
cur_label_counts_left;
-          cur_NC_large[1,(2 * i6)] = right_child;
-          cur_NC_large[2,(2 * i6)] = cur_tree;
-          cur_NC_large[3,(2 * i6)] = as.scalar( rowIndexMax 
(cur_label_counts_right)); # leaf class label
-          right_pure = TRUE;
-          # compute number of misclassified pints
-          cur_NC_large[4,(2 * i6)] = right_child_size - max 
(cur_label_counts_right);
-  
-        } else if (left_child_size <= num_leaf | cur_impurity_left == 0 | 
(level == depth) |
-              (left_child_size <= threshold & (level == depth))) {
-  
-          cur_label_counts_left = label_counts_left_scale[best_scale_split,]; 
# max_I_gain_ind_scale
-          cur_NC_large[1,(2 * (i6 - 1) + 1)] = left_child;
-          cur_NC_large[2,(2 * (i6 - 1) + 1)] = cur_tree;
-          cur_NC_large[3,(2 * (i6 - 1) + 1)] = as.scalar( rowIndexMax 
(cur_label_counts_left)); # leaf class label
-          left_pure = TRUE;
-          # compute number of misclassified points
-          cur_NC_large[4,(2 * (i6 - 1) + 1)] = left_child_size - max 
(cur_label_counts_left);
-  
-        } else if (right_child_size <= num_leaf | cur_impurity_right == 0 | 
(level == depth) |
-              (right_child_size <= threshold & (level == depth))) {
-  
-          cur_label_counts_right = 
label_counts_right_scale[best_scale_split,]; # max_I_gain_ind_scale
-          cur_NC_large[1,(2 * i6)] = right_child;
-          cur_NC_large[2,(2 * i6)] = cur_tree;
-          cur_NC_large[3,(2 * i6)] = as.scalar( rowIndexMax 
(cur_label_counts_right)); # leaf class label
-          right_pure = TRUE;
-          # compute number of misclassified pints
-          cur_NC_large[4,(2 * i6)] = right_child_size - max 
(cur_label_counts_right);
-        }
-      }
-      else if (num_cat_features > 0 & num_cat_feature_samples > 0 & 
best_cat_gain > 0) {
-        # --- update model ---
-        cur_F_large[1,i6] = best_cat_feature;
-        cur_F_large[2,i6] = 2;
-        offset_nonzero = as.scalar (split_values_offset[1,best_cat_feature]);
-        S_start_ind = (i6 - 1) * distinct_values_max + 1;
-        cur_F_large[3,i6] = offset_nonzero;
-        cur_S_large[1,S_start_ind:(S_start_ind + offset_nonzero - 1)] = 
split_values[1,start_ind:(start_ind + offset_nonzero - 1)];
-  
-        left_child = 2 * (cur_node - 1) + 1 + 1;
-        right_child = 2 * (cur_node - 1) + 2 + 1;
-  
-        # samples going to the left subtree
-        Ix_left = rowSums (X_cat[,start_ind:(start_ind + offset - 1)] * 
best_split_values_bin);
-        Ix_left = (Ix_left >= 1);
-        Ix_left = Ix * Ix_left;
-        Ix_right = Ix * (1 - Ix_left);
-        L[,cur_tree] = L[,cur_tree] * (1 - Ix_left) + (Ix_left * left_child);
-        L[,cur_tree] = L[,cur_tree] * (1 - Ix_right) + (Ix_right * 
right_child);
-        left_child_size = sum (Ix_left * C[,cur_tree]);
-        right_child_size = sum (Ix_right * C[,cur_tree]);
-  
-        # check if left or right child is a leaf
-        left_pure = FALSE;
-        right_pure = FALSE;
-        cur_impurity_left = as.scalar(impurities_left[,best_cat_feature]);
-        cur_impurity_right = as.scalar(impurities_right[,best_cat_feature]);
-        if ( (left_child_size <= num_leaf | cur_impurity_left == 0 | (level == 
depth)) &
-           (right_child_size <= num_leaf | cur_impurity_right == 0 | (level == 
depth)) |
-           (left_child_size <= threshold & right_child_size <= threshold & 
(level == depth)) ) { # both left and right nodes are leaf
-  
-          cur_label_counts_left = best_label_counts_left[best_cat_feature,];
-          cur_NC_large[1,(2 * (i6 - 1) + 1)] = left_child;
-          cur_NC_large[2,(2 * (i6 - 1) + 1)] = cur_tree;
-          cur_NC_large[3,(2 * (i6 - 1) + 1)] = as.scalar( rowIndexMax 
(cur_label_counts_left)); # leaf class label
-          left_pure = TRUE;
-          # compute number of misclassified points
-          cur_NC_large[4,(2 * (i6 - 1) + 1)] = left_child_size - max 
(cur_label_counts_left);
-  
-          cur_label_counts_right = label_counts_overall - 
cur_label_counts_left;
-          cur_NC_large[1,(2 * i6)] = right_child;
-          cur_NC_large[2,(2 * i6)] = cur_tree;
-          cur_NC_large[3,(2 * i6)] = as.scalar( rowIndexMax 
(cur_label_counts_right)); # leaf class label
-          right_pure = TRUE;
-          # compute number of misclassified pints
-          cur_NC_large[4,(2 * i6)] = right_child_size - max 
(cur_label_counts_right);
-  
-        } else if (left_child_size <= num_leaf | cur_impurity_left == 0 | 
(level == depth) |
-              (left_child_size <= threshold & (level == depth))) {
-  
-          cur_label_counts_left = best_label_counts_left[best_cat_feature,];
-          cur_NC_large[1,(2 * (i6 - 1) + 1)] = left_child;
-          cur_NC_large[2,(2 * (i6 - 1) + 1)] = cur_tree;
-          cur_NC_large[3,(2 * (i6 - 1) + 1)] = as.scalar( rowIndexMax 
(cur_label_counts_left)); # leaf class label
-          left_pure = TRUE;
-          # compute number of misclassified points
-          cur_NC_large[4,(2 * (i6 - 1) + 1)] = left_child_size - max 
(cur_label_counts_left);
-  
-        } else if (right_child_size <= num_leaf | cur_impurity_right == 0 | 
(level == depth) |
-              (right_child_size <= threshold & (level == depth))) {
-  
-          cur_label_counts_right = best_label_counts_right[best_cat_feature,];
-          cur_NC_large[1,(2 * i6)] = right_child;
-          cur_NC_large[2,(2 * i6)] = cur_tree;
-          cur_NC_large[3,(2 * i6)] = as.scalar( rowIndexMax 
(cur_label_counts_right)); # leaf class label
-          right_pure = TRUE;
-          # compute number of misclassified pints
-          cur_NC_large[4,(2 * i6)] = right_child_size - max 
(cur_label_counts_right);
-  
-        }
-      } else {
-        print ("NUMBER OF SAMPLES AT NODE " + cur_node + " in tree " + 
cur_tree + " CANNOT BE REDUCED TO MATCH " + num_leaf + ". THIS NODE IS DECLARED 
AS LEAF!");
-        right_pure = TRUE;
-        left_pure = TRUE;
-        cur_NC_large[1,(2 * (i6 - 1) + 1)] = cur_node;
-        cur_NC_large[2,(2 * (i6 - 1) + 1)] = cur_tree;
-        class_label = as.scalar (rowIndexMax (label_counts_overall));
-        cur_NC_large[3,(2 * (i6 - 1) + 1)] = class_label;
-        cur_NC_large[4,(2 * (i6 - 1) + 1)] = label_sum_overall - max 
(label_counts_overall);
-        cur_NC_large[5,(2 * (i6 - 1) + 1)] = 1; # special leaf
-      }
-  
-      # add nodes to Q
-      if (!left_pure) {
-        if (left_child_size > threshold) {
-          cur_Q_large[1,(2 * (i6 - 1)+ 1)] = left_child;
-          cur_Q_large[2,(2 * (i6 - 1)+ 1)] = cur_tree;
-        } else {
-          cur_nodes_small[1,(2 * (i6 - 1)+ 1)] = left_child;
-          cur_nodes_small[2,(2 * (i6 - 1)+ 1)] = left_child_size;
-          cur_nodes_small[3,(2 * (i6 - 1)+ 1)] = cur_tree;
-        }
-      }
-      if (!right_pure) {
-        if (right_child_size > threshold) {
-          cur_Q_large[1,(2 * i6)] = right_child;
-          cur_Q_large[2,(2 * i6)] = cur_tree;
-        } else{
-          cur_nodes_small[1,(2 * i6)] = right_child;
-          cur_nodes_small[2,(2 * i6)] = right_child_size;
-          cur_nodes_small[3,(2 * i6)] = cur_tree;
-        }
-      }
-    }
+  # training of num_tree decision trees
+  M = matrix(0, rows=num_trees, cols=2^max_depth-1);
+  F = matrix(0, rows=num_trees, cols=ncol(X));
+  parfor(i in 1:num_trees) {
+    if( verbose )
+      print("randomForest: start training tree "+i+"/"+num_trees+".");
 
-    ##### PREPARE MODEL FOR LARGE NODES
-    if (num_cur_nodes_large > 0) {
-      cur_Q_large = removeEmpty (target = cur_Q_large, margin = "cols");
-      if (as.scalar (cur_Q_large[1,1]) != 0) Q_large = cbind (Q_large, 
cur_Q_large);
-      cur_NC_large = removeEmpty (target = cur_NC_large, margin = "cols");
-      if (as.scalar (cur_NC_large[1,1]) != 0) NC_large = cbind (NC_large, 
cur_NC_large);
-  
-      cur_F_large = removeEmpty (target = cur_F_large, margin = "cols");
-      if (as.scalar (cur_F_large[1,1]) != 0) F_large = cbind (F_large, 
cur_F_large);
-      cur_S_large = removeEmpty (target = cur_S_large, margin = "cols");
-      if (as.scalar (cur_S_large[1,1]) != 0) S_large = cbind (S_large, 
cur_S_large);
-  
-      num_cur_nodes_large_pre = 2 * num_cur_nodes_large;
-      if (as.scalar (cur_Q_large[1,1]) == 0) {
-        num_cur_nodes_large = 0;
-      } else {
-        cur_nodes_large = cur_Q_large;
-        num_cur_nodes_large = ncol (cur_Q_large);
-      }
-    }
-  
-    ##### PREPARE MODEL FOR SMALL NODES
-    cur_nodes_small_nonzero = removeEmpty (target = cur_nodes_small, margin = 
"cols");
-    if (as.scalar (cur_nodes_small_nonzero[1,1]) != 0) { # if SMALL nodes exist
-      num_cur_nodes_small = ncol (cur_nodes_small_nonzero);
-    }
+    # step 1: sample data
+    si1 = as.integer(as.scalar(randSeeds[3*(i-1)+1,1]));
+    I1 = rand(rows=nrow(X), cols=1, seed=si1) <= sample_frac;
+    Xi = removeEmpty(target=X, margin="rows", select=I1);
+    yi = removeEmpty(target=y, margin="rows", select=I1);
 
-    if (num_cur_nodes_small > 0) { # SMALL nodes to process
-      reserve_len = sum (2 ^ (ceil (log (cur_nodes_small_nonzero[2,]) / log 
(2)))) + num_cur_nodes_small;
-      cur_Q_small =  matrix (0, rows = 2, cols = reserve_len);
-      cur_F_small = matrix (0, rows = 3, cols = reserve_len);
-      cur_NC_small = matrix (0, rows = 5, cols = reserve_len);
-      cur_S_small = matrix (0, rows = 1, cols = reserve_len * 
distinct_values_max); # split values of the best feature
+    # step 2: sample features
+    if( feature_frac < 1.0 ) {
+      si2 = as.integer(as.scalar(randSeeds[3*(i-1)+2,1]));
+      I2 = rand(rows=ncol(X), cols=1, seed=si2) <= feature_frac;
+      Xi = removeEmpty(target=Xi, margin="cols", select=I2);
+      F[i,] = t(I2);
     }
-  
-    ##### LOOP OVER SMALL NODES...
-    for (i7 in 1:num_cur_nodes_small) {
-      cur_node_small = as.scalar (cur_nodes_small_nonzero[1,i7]);
-      cur_tree_small = as.scalar (cur_nodes_small_nonzero[3,i7]);
-      # build dataset for SMALL node
-      Ix = (L[,cur_tree_small] == cur_node_small);
-      #print(as.scalar(Ix[0,0]));
-      if (num_scale_features > 0)
-        X_scale_ext_small = removeEmpty (target = X_scale_ext, margin = 
"rows", select = Ix);
-      if (num_cat_features > 0)
-        X_cat_small = removeEmpty (target = X_cat, margin = "rows", select = 
Ix);
 
-      L_small = removeEmpty (target = L * Ix, margin = "rows");
-      C_small = removeEmpty (target = C * Ix, margin = "rows");
-      Y_bin_small = removeEmpty (target = Y_bin * Ix, margin = "rows");
-  
-      # compute offset
-      offsets = cumsum (t (2 ^ ceil (log (cur_nodes_small_nonzero[2,]) / log 
(2))));
-      start_ind_global = 1;
-      if (i7 > 1)
-        start_ind_global = start_ind_global + as.scalar (offsets[(i7 - 1),]);
-      start_ind_S_global = 1;
-      if (i7 > 1)
-        start_ind_S_global = start_ind_S_global + (as.scalar (offsets[(i7 - 
1),]) * distinct_values_max);
-  
-      Q = matrix (0, rows = 2, cols = 1);
-      Q[1,1] = cur_node_small;
-      Q[2,1] = cur_tree_small;
-      F = matrix (0, rows = 3, cols = 1);
-      NC = matrix (0, rows = 5, cols = 1);
-      S = matrix (0, rows = 1, cols = 1);
-      cur_nodes_ = matrix (cur_node_small, rows = 2, cols = 1);
-      cur_nodes_[1,1] = cur_node_small;
-      cur_nodes_[2,1] = cur_tree_small;
-      num_cur_nodes = 1;
-      level_ = level;
-  
-      while (num_cur_nodes > 0 & level_ < depth) {
-        level_ = level_ + 1;
-        cur_Q = matrix (0, rows = 2, cols = 2 * num_cur_nodes);
-        cur_F = matrix (0, rows = 3, cols = num_cur_nodes);
-        cur_NC = matrix (0, rows = 5, cols = 2 * num_cur_nodes);
-        cur_S = matrix (0, rows = 1, cols = num_cur_nodes * 
distinct_values_max);
-  
-        parfor (i8 in 1:num_cur_nodes, check = 0) {
-          cur_node = as.scalar (cur_nodes_[1,i8]);
-          cur_tree = as.scalar (cur_nodes_[2,i8]);
-  
-          # select sample features WOR
-          feature_samples = sample (num_features_total, num_feature_samples);
-          feature_samples = order (target = feature_samples, by = 1);
-          num_scale_feature_samples = sum (feature_samples <= 
num_scale_features);
-          num_cat_feature_samples = num_feature_samples - 
num_scale_feature_samples;
-  
-          # --- find best split ---
-          # samples that reach cur_node
-          Ix = (L_small[,cur_tree] == cur_node);
-          cur_Y_bin = Y_bin_small * (Ix * C_small[,cur_tree]);
-          label_counts_overall = colSums (cur_Y_bin);
-  
-          label_sum_overall = sum (label_counts_overall);
-          label_dist_overall = label_counts_overall / label_sum_overall;
-          if (imp == "entropy") {
-            label_dist_zero = (label_dist_overall == 0);
-            cur_impurity = - sum (label_dist_overall * log (label_dist_overall 
+ label_dist_zero)); # / log (2);
-          } else { # imp == "Gini"
-            cur_impurity = sum (label_dist_overall * (1 - label_dist_overall));
-          }
-          best_scale_gain = 0;
-          best_cat_gain = 0;
-  
-          if (num_scale_features > 0 & num_scale_feature_samples > 0) {
-            scale_feature_samples = 
feature_samples[1:num_scale_feature_samples,];
-  
-            # main operation
-            label_counts_left_scale = t (t (cur_Y_bin) %*% X_scale_ext_small);
-  
-            # compute left and right label distribution
-            label_sum_left = rowSums (label_counts_left_scale);
-            label_dist_left = label_counts_left_scale / label_sum_left;
-            if (imp == "entropy") {
-              label_dist_left = replace (target = label_dist_left, pattern = 
0, replacement = 1);
-              log_label_dist_left = log (label_dist_left); # / log (2)
-              impurity_left_scale = - rowSums (label_dist_left * 
log_label_dist_left);
-            } else { # imp == "Gini"
-              impurity_left_scale = rowSums (label_dist_left * (1 - 
label_dist_left));
-            }
-            #
-            label_counts_right_scale = - label_counts_left_scale + 
label_counts_overall;
-            label_sum_right = rowSums (label_counts_right_scale);
-            label_dist_right = label_counts_right_scale / label_sum_right;
-            if (imp == "entropy") {
-              label_dist_right = replace (target = label_dist_right, pattern = 
0, replacement = 1);
-              log_label_dist_right = log (label_dist_right); # log (2)
-              impurity_right_scale = - rowSums (label_dist_right * 
log_label_dist_right);
-            } else { # imp == "Gini"
-              impurity_right_scale = rowSums (label_dist_right * (1 - 
label_dist_right));
-            }
-            I_gain_scale = cur_impurity - ( ( label_sum_left / 
label_sum_overall ) * impurity_left_scale + ( label_sum_right / 
label_sum_overall ) * impurity_right_scale);
-  
-            I_gain_scale = replace (target = I_gain_scale, pattern = NaN, 
replacement = 0);
-  
-            # determine best feature to split on and the split value
-            feature_start_ind = matrix (0, rows = 1, cols = 
num_scale_features);
-            feature_start_ind[1,1] = 1;
-            if (num_scale_features > 1) {
-              feature_start_ind[1,2:num_scale_features] = 
cum_count_thresholds[1,1:(num_scale_features - 1)] + 1;
-            }
-            max_I_gain_found = 0;
-            max_I_gain_found_ind = 0;
-            best_i = 0;
-  
-            for (i in 1:num_scale_feature_samples) { # assuming 
feature_samples is 5x1
-              cur_feature_samples_bin = as.scalar (scale_feature_samples[i,]);
-              cur_start_ind = as.scalar 
(feature_start_ind[,cur_feature_samples_bin]);
-              cur_end_ind = as.scalar 
(cum_count_thresholds[,cur_feature_samples_bin]);
-              I_gain_portion = I_gain_scale[cur_start_ind:cur_end_ind,];
-              cur_max_I_gain = max (I_gain_portion);
-              cur_max_I_gain_ind = as.scalar (rowIndexMax (t 
(I_gain_portion)));
-              if (cur_max_I_gain > max_I_gain_found) {
-                max_I_gain_found = cur_max_I_gain;
-                max_I_gain_found_ind = cur_max_I_gain_ind;
-                best_i = i;
-              }
-            }
-  
-            best_scale_gain = max_I_gain_found;
-            max_I_gain_ind_scale = max_I_gain_found_ind;
-            best_scale_feature = 0;
-            if (best_i > 0) {
-              best_scale_feature = as.scalar (scale_feature_samples[best_i,]);
-            }
-            best_scale_split = max_I_gain_ind_scale;
-            if (best_scale_feature > 1) {
-              best_scale_split = best_scale_split + 
as.scalar(cum_count_thresholds[,(best_scale_feature - 1)]);
-            }
-          }
+    if( verbose )
+      print("-- ["+i+"] sampled "+nrow(Xi)+"/"+nrow(X)+" rows and 
"+ncol(Xi)+"/"+ncol(X)+" cols.");
 
-          if (num_cat_features > 0 & num_cat_feature_samples > 0){
-            cat_feature_samples = feature_samples[(num_scale_feature_samples + 
1):(num_scale_feature_samples + num_cat_feature_samples),] - num_scale_features;
-  
-            # initialization
-            split_values_bin = matrix (0, rows = 1, cols = 
distinct_values_overall);
-            split_values = split_values_bin;
-            split_values_offset = matrix (0, rows = 1, cols = 
num_cat_features);
-            I_gains = split_values_offset;
-            impurities_left = split_values_offset;
-            impurities_right = split_values_offset;
-            best_label_counts_left = matrix (0, rows = num_cat_features, cols 
= num_classes);
-            best_label_counts_right = matrix (0, rows = num_cat_features, cols 
= num_classes);
-  
-            # main operation
-            label_counts = t (t (cur_Y_bin) %*% X_cat_small);
-  
-            parfor (i9 in 1:num_cat_feature_samples, check = 0){
-              cur_cat_feature = as.scalar (cat_feature_samples[i9,1]);
-              start_ind = 1;
-              if (cur_cat_feature > 1)
-                start_ind = start_ind + as.scalar 
(distinct_values_offset[(cur_cat_feature - 1),]);
-              offset = as.scalar (distinct_values[1,cur_cat_feature]);
-              cur_label_counts = label_counts[start_ind:(start_ind + offset - 
1),];
-              label_sum = rowSums (cur_label_counts);
-              label_dist = cur_label_counts / label_sum;
-              if (imp == "entropy") {
-                label_dist = replace (target = label_dist, pattern = 0, 
replacement = 1);
-                log_label_dist = log (label_dist); # / log(2)
-                impurity_tmp = - rowSums (label_dist * log_label_dist);
-                impurity_tmp = replace (target = impurity_tmp, pattern = NaN, 
replacement = 1/0);
-              } else { # imp == "Gini"
-                impurity_tmp = rowSums (label_dist * (1 - label_dist));
-              }
-  
-              # sort cur feature by impurity
-              cur_distinct_values = seq (1, nrow (cur_label_counts));
-              cur_distinct_values_impurity = cbind (cur_distinct_values, 
impurity_tmp);
-              cur_feature_sorted = order (target = 
cur_distinct_values_impurity, by = 2, decreasing = FALSE);
-              P = table (cur_distinct_values, cur_feature_sorted); # 
permutation matrix
-              label_counts_sorted = P %*% cur_label_counts;
-  
-              # compute left and right label distribution
-              label_counts_left = cumsum (label_counts_sorted);
-  
-              label_sum_left = rowSums (label_counts_left);
-              label_dist_left = label_counts_left / label_sum_left;
-              label_dist_left = replace (target = label_dist_left, pattern = 
NaN, replacement = 1);
-              if (imp == "entropy") {
-                label_dist_left = replace (target = label_dist_left, pattern = 
0, replacement = 1);
-                log_label_dist_left = log (label_dist_left); # / log(2)
-                impurity_left = - rowSums (label_dist_left * 
log_label_dist_left);
-              } else { # imp == "Gini"
-                impurity_left = rowSums (label_dist_left * (1 - 
label_dist_left));
-              }
-              #
-              label_counts_right = - label_counts_left + label_counts_overall;
-              label_sum_right = rowSums (label_counts_right);
-              label_dist_right = label_counts_right / label_sum_right;
-              label_dist_right = replace (target = label_dist_right, pattern = 
NaN, replacement = 1);
-              if (imp == "entropy") {
-                label_dist_right = replace (target = label_dist_right, pattern 
= 0, replacement = 1);
-                log_label_dist_right = log (label_dist_right); # / log (2)
-                impurity_right = - rowSums (label_dist_right * 
log_label_dist_right);
-              } else { # imp == "Gini"
-                impurity_right = rowSums (label_dist_right * (1 - 
label_dist_right));
-              }
-              I_gain = cur_impurity - ( ( label_sum_left / label_sum_overall ) 
* impurity_left + ( label_sum_right / label_sum_overall ) * impurity_right);
-  
-              Ix_label_sum_left_zero = (label_sum_left == 0);
-              Ix_label_sum_right_zero = (label_sum_right == 0);
-              Ix_label_sum_zero = Ix_label_sum_left_zero * 
Ix_label_sum_right_zero;
-              I_gain = I_gain * (1 - Ix_label_sum_zero);
-  
-              I_gain[nrow (I_gain),] = 0; # last entry invalid
-              max_I_gain_ind = as.scalar (rowIndexMax (t (I_gain)));
-  
-              split_values[1, start_ind:(start_ind + max_I_gain_ind - 1)] = t 
(cur_feature_sorted[1:max_I_gain_ind,1]);
-              for (i10 in 1:max_I_gain_ind) {
-                ind = as.scalar (cur_feature_sorted[i10,1]);
-                if (ind == 1)
-                  split_values_bin[1,start_ind] = 1.0;
-                else
-                  split_values_bin[1,(start_ind + ind - 1)] = 1.0;
-              }
-              split_values_offset[1,cur_cat_feature] = max_I_gain_ind;
-  
-              I_gains[1,cur_cat_feature] = max (I_gain);
-  
-              impurities_left[1,cur_cat_feature] = as.scalar 
(impurity_left[max_I_gain_ind,]);
-              impurities_right[1,cur_cat_feature] = as.scalar 
(impurity_right[max_I_gain_ind,]);
-              best_label_counts_left[cur_cat_feature,] = 
label_counts_left[max_I_gain_ind,];
-              best_label_counts_right[cur_cat_feature,] = 
label_counts_right[max_I_gain_ind,];
-            }
-  
-            # determine best feature to split on and the split values
-            best_cat_feature = as.scalar (rowIndexMax (I_gains));
-            best_cat_gain = max (I_gains);
-            start_ind = 1;
-            if (best_cat_feature > 1) {
-              start_ind = start_ind + as.scalar 
(distinct_values_offset[(best_cat_feature - 1),]);
-            }
-            offset = as.scalar (distinct_values[1,best_cat_feature]);
-            best_split_values_bin = split_values_bin[1, start_ind:(start_ind + 
offset - 1)];
-          }
-  
-          # compare best scale feature to best cat. feature and pick the best 
one
-          if (num_scale_features > 0 & num_scale_feature_samples > 0 & 
best_scale_gain >= best_cat_gain & best_scale_gain > 0) {
-            # --- update model ---
-            cur_F[1,i8] = best_scale_feature;
-            cur_F[2,i8] = 1;
-            cur_F[3,i8] = 1;
-            cur_S[1,(i8 - 1) * distinct_values_max + 1] = 
thresholds[max_I_gain_ind_scale, best_scale_feature];
-  
-            left_child = 2 * (cur_node - 1) + 1 + 1;
-            right_child = 2 * (cur_node - 1) + 2 + 1;
-  
-            # samples going to the left subtree
-            Ix_left = X_scale_ext_small[, best_scale_split];
-            Ix_left = Ix * Ix_left;
-            Ix_right = Ix * (1 - Ix_left);
-            L_small[,cur_tree] = L_small[,cur_tree] * (1 - Ix_left) + (Ix_left 
* left_child);
-            L_small[,cur_tree] = L_small[,cur_tree] * (1 - Ix_right) + 
(Ix_right * right_child);
-            left_child_size = sum (Ix_left * C_small[,cur_tree]);
-            right_child_size = sum (Ix_right * C_small[,cur_tree]);
-  
-            # check if left or right child is a leaf
-            left_pure = FALSE;
-            right_pure = FALSE;
-            cur_impurity_left = 
as.scalar(impurity_left_scale[best_scale_split,]);
-            cur_impurity_right = 
as.scalar(impurity_right_scale[best_scale_split,]);
-            if ( (left_child_size <= num_leaf | cur_impurity_left == 0 | 
level_ == depth) &
-               (right_child_size <= num_leaf | cur_impurity_right == 0 | 
level_ == depth) ) { # both left and right nodes are leaf
-  
-              cur_label_counts_left = 
label_counts_left_scale[best_scale_split,];
-              cur_NC[1,(2 * (i8 - 1) + 1)] = left_child;
-              cur_NC[2,(2 * (i8 - 1) + 1)] = cur_tree;
-              cur_NC[3,(2 * (i8 - 1) + 1)] = as.scalar( rowIndexMax 
(cur_label_counts_left)); # leaf class label
-              left_pure = TRUE;
-              # compute number of misclassified points
-              cur_NC[4,(2 * (i8 - 1) + 1)] = left_child_size - max 
(cur_label_counts_left);
-  
-              cur_label_counts_right = label_counts_overall - 
cur_label_counts_left;
-              cur_NC[1,(2 * i8)] = right_child;
-              cur_NC[2,(2 * i8)] = cur_tree;
-              cur_NC[3,(2 * i8)] = as.scalar( rowIndexMax 
(cur_label_counts_right)); # leaf class label
-              right_pure = TRUE;
-              # compute number of misclassified points
-              cur_NC[4,(2 * i8)] = right_child_size - max 
(cur_label_counts_right);
-  
-            } else if (left_child_size <= num_leaf | cur_impurity_left == 0 | 
level_ == depth) {
-  
-              cur_label_counts_left = 
label_counts_left_scale[best_scale_split,];
-              cur_NC[1,(2 * (i8 - 1) + 1)] = left_child;
-              cur_NC[2,(2 * (i8 - 1) + 1)] = cur_tree;
-              cur_NC[3,(2 * (i8 - 1) + 1)] = as.scalar( rowIndexMax 
(cur_label_counts_left)); # leaf class label
-              left_pure = TRUE;
-              # compute number of misclassified points
-              cur_NC[4,(2 * (i8 - 1) + 1)] = left_child_size - max 
(cur_label_counts_left);
-  
-            } else if (right_child_size <= num_leaf | cur_impurity_right == 0 
| level_ == depth) {
-  
-              cur_label_counts_right = 
label_counts_right_scale[best_scale_split,];
-              cur_NC[1,(2 * i8)] = right_child;
-              cur_NC[2,(2 * i8)] = cur_tree;
-              cur_NC[3,(2 * i8)] = as.scalar( rowIndexMax 
(cur_label_counts_right)); # leaf class label
-              right_pure = TRUE;
-              # compute number of misclassified points
-              cur_NC[4,(2 * i8)] = right_child_size - max 
(cur_label_counts_right);
-            }
-          } else if (num_cat_features > 0 & num_cat_feature_samples > 0 & 
best_cat_gain > 0) {
-  
-            # --- update model ---
-            cur_F[1,i8] = best_cat_feature;
-            cur_F[2,i8] = 2;
-            offset_nonzero = as.scalar 
(split_values_offset[1,best_cat_feature]);
-            S_start_ind = (i8 - 1) * distinct_values_max + 1;
-            cur_F[3,i8] = offset_nonzero;
-            cur_S[1,S_start_ind:(S_start_ind + offset_nonzero - 1)] = 
split_values[1,start_ind:(start_ind + offset_nonzero - 1)];
-  
-            left_child = 2 * (cur_node - 1) + 1 + 1;
-            right_child = 2 * (cur_node - 1) + 2 + 1;
-  
-            # samples going to the left subtree
-            Ix_left = rowSums (X_cat_small[,start_ind:(start_ind + offset - 
1)] * best_split_values_bin);
-            Ix_left = (Ix_left >= 1);
-            Ix_left = Ix * Ix_left;
-            Ix_right = Ix * (1 - Ix_left);
-            L_small[,cur_tree] = L_small[,cur_tree] * (1 - Ix_left) + (Ix_left 
* left_child);
-            L_small[,cur_tree] = L_small[,cur_tree] * (1 - Ix_right) + 
(Ix_right * right_child);
-            left_child_size = sum (Ix_left * C_small[,cur_tree]);
-            right_child_size = sum (Ix_right * C_small[,cur_tree]);
-  
-            # check if left or right child is a leaf
-            left_pure = FALSE;
-            right_pure = FALSE;
-            cur_impurity_left = as.scalar(impurities_left[,best_cat_feature]);
-            cur_impurity_right = 
as.scalar(impurities_right[,best_cat_feature]);
-            if ( (left_child_size <= num_leaf | cur_impurity_left == 0 | 
level_ == depth) &
-               (right_child_size <= num_leaf | cur_impurity_right == 0 | 
level_ == depth) ) { # both left and right nodes are leaf
-  
-              cur_label_counts_left = 
best_label_counts_left[best_cat_feature,];
-              cur_NC[1,(2 * (i8 - 1) + 1)] = left_child;
-              cur_NC[2,(2 * (i8 - 1) + 1)] = cur_tree;
-              cur_NC[3,(2 * (i8 - 1) + 1)] = as.scalar( rowIndexMax 
(cur_label_counts_left)); # leaf class label
-              left_pure = TRUE;
-              # compute number of misclassified points
-              cur_NC[4,(2 * (i8 - 1) + 1)] = left_child_size - max 
(cur_label_counts_left);
-  
-              cur_label_counts_right = label_counts_overall - 
cur_label_counts_left;
-              cur_NC[1,(2 * i8)] = right_child;
-              cur_NC[2,(2 * i8)] = cur_tree;
-              cur_NC[3,(2 * i8)] = as.scalar( rowIndexMax 
(cur_label_counts_right)); # leaf class label
-              right_pure = TRUE;
-              # compute number of misclassified points
-              cur_NC[4,(2 * i8)] = right_child_size - max 
(cur_label_counts_right);
-  
-            } else if (left_child_size <= num_leaf | cur_impurity_left == 0 | 
level_ == depth) {
-  
-              cur_label_counts_left = 
best_label_counts_left[best_cat_feature,];
-              cur_NC[1,(2 * (i8 - 1) + 1)] = left_child;
-              cur_NC[2,(2 * (i8 - 1) + 1)] = cur_tree;
-              cur_NC[3,(2 * (i8 - 1) + 1)] = as.scalar( rowIndexMax 
(cur_label_counts_left)); # leaf class label
-              left_pure = TRUE;
-              # compute number of misclassified points
-              cur_NC[4,(2 * (i8 - 1) + 1)] = left_child_size - max 
(cur_label_counts_left);
-  
-            } else if (right_child_size <= num_leaf | cur_impurity_right == 0 
| level_ == depth) {
-              cur_label_counts_right = 
best_label_counts_right[best_cat_feature,];
-              cur_NC[1,(2 * i8)] = right_child;
-              cur_NC[2,(2 * i8)] = cur_tree;
-              cur_NC[3,(2 * i8)] = as.scalar( rowIndexMax 
(cur_label_counts_right)); # leaf class label
-              right_pure = TRUE;
-              # compute number of misclassified points
-              cur_NC[4,(2 * i8)] = right_child_size - max 
(cur_label_counts_right);
-            }
-          } else {
-            print ("NUMBER OF SAMPLES AT NODE " + cur_node + " in tree " + 
cur_tree + " CANNOT BE REDUCED TO MATCH " + num_leaf + ". THIS NODE IS DECLARED 
AS LEAF!");
-            right_pure = TRUE;
-            left_pure = TRUE;
-            cur_NC[1,(2 * (i8 - 1) + 1)] = cur_node;
-            cur_NC[2,(2 * (i8 - 1) + 1)] = cur_tree;
-            class_label = as.scalar (rowIndexMax (label_counts_overall));
-            cur_NC[3,(2 * (i8 - 1) + 1)] = class_label;
-            cur_NC[4,(2 * (i8 - 1) + 1)] = label_sum_overall - max 
(label_counts_overall);
-            cur_NC[5,(2 * (i8 - 1) + 1)] = 1; # special leaf
-          }
-  
-          # add nodes to Q
-          if (!left_pure)
-            cur_Q[,(2 * (i8 - 1)+ 1)] = as.matrix(list(left_child,cur_tree));
-          if (!right_pure)
-            cur_Q[,(2 * i8)] = as.matrix(list(right_child, cur_tree));
-        }
-  
-        cur_Q = removeEmpty (target = cur_Q, margin = "cols");
-        Q = cbind (Q, cur_Q);
-        NC = cbind (NC, cur_NC);
-        F = cbind (F, cur_F);
-        S = cbind (S, cur_S);
-  
-        num_cur_nodes_pre = 2 * num_cur_nodes;
-        if (as.scalar (cur_Q[1,1]) == 0) {
-          num_cur_nodes = 0;
-        } else {
-          cur_nodes_ = cur_Q;
-          num_cur_nodes = ncol (cur_Q);
-        }
-      }
-  
-      cur_Q_small[,start_ind_global:(start_ind_global + ncol (Q) - 1)] = Q;
-      cur_NC_small[,start_ind_global:(start_ind_global + ncol (NC) - 1)] = NC;
-      cur_F_small[,start_ind_global:(start_ind_global + ncol (F) - 1)] = F;
-      cur_S_small[,start_ind_S_global:(start_ind_S_global + ncol (S) - 1)] = S;
-    }
-    
-    ##### PREPARE MODEL FOR SMALL NODES
-    if (num_cur_nodes_small > 0) {  # small nodes already processed
-      cur_Q_small = removeEmpty (target = cur_Q_small, margin = "cols");
-      if (as.scalar (cur_Q_small[1,1]) != 0) Q_small = cbind (Q_small, 
cur_Q_small);
-      cur_NC_small = removeEmpty (target = cur_NC_small, margin = "cols");
-      if (as.scalar (cur_NC_small[1,1]) != 0) NC_small = cbind (NC_small, 
cur_NC_small);
-  
-      cur_F_small = removeEmpty (target = cur_F_small, margin = "cols"); #
-      if (as.scalar (cur_F_small[1,1]) != 0) F_small = cbind (F_small, 
cur_F_small);
-      cur_S_small = removeEmpty (target = cur_S_small, margin = "cols"); #
-      if (as.scalar (cur_S_small[1,1]) != 0) S_small = cbind (S_small, 
cur_S_small);
-  
-      num_cur_nodes_small = 0; # reset
-    }
-    print (" --- end level " + level + ", remaining no. of LARGE nodes to 
expand " + num_cur_nodes_large + " --- ");
+    # step 3: train decision tree
+    t2 = time();
+    si3 = as.integer(as.scalar(randSeeds[3*(i-1)+3,1]));
+    Mtemp = decisionTree(X=Xi, Y=yi, R=ctypes, depth=max_depth);
+    # TODO add min_leaf=min_leaf, max_features = max_features, 
impurity=impurity, seed=si3, verbose=verbose);
+    M[i,1:length(Mtemp)] = matrix(Mtemp, rows=1, cols=length(Mtemp));
+    if( verbose )
+      print("-- ["+i+"] trained decision tree in "+(time()-t2)/1e9+" 
seconds.");
   }
-  
-  #### prepare model
-  print ("PREPARING MODEL...")
-  ### large nodes
-  if (as.scalar (Q_large[1,1]) == 0 & ncol (Q_large) > 1)
-    Q_large = Q_large[,2:ncol (Q_large)];
-  if (as.scalar (NC_large[1,1]) == 0 & ncol (NC_large) > 1)
-    NC_large = NC_large[,2:ncol (NC_large)];
-  if (as.scalar (S_large[1,1]) == 0 & ncol (S_large) > 1)
-    S_large = S_large[,2:ncol (S_large)];
-  if (as.scalar (F_large[1,1]) == 0 & ncol (F_large) > 1)
-    F_large = F_large[,2:ncol (F_large)];
+  M = cbind(F, M);
 
-  ### small nodes
-  if (as.scalar (Q_small[1,1]) == 0 & ncol (Q_small) > 1)
-    Q_small = Q_small[,2:ncol (Q_small)];
-  if (as.scalar (NC_small[1,1]) == 0 & ncol (NC_small) > 1)
-    NC_small = NC_small[,2:ncol (NC_small)];
-  if (as.scalar (S_small[1,1]) == 0 & ncol (S_small) > 1)
-    S_small = S_small[,2:ncol (S_small)];
-  if (as.scalar (F_small[1,1]) == 0 & ncol (F_small) > 1)
-    F_small = F_small[,2:ncol (F_small)];
-  
-  # check for special leaves and if there are any remove them from Q_large and 
Q_small
-  special_large_leaves_ind = NC_large[5,];
-  num_special_large_leaf = sum (special_large_leaves_ind);
-  if (num_special_large_leaf > 0) {
-    print ("PROCESSING " + num_special_large_leaf + " SPECIAL LARGE 
LEAVES...");
-    special_large_leaves = removeEmpty (target = NC_large[1:2,] * 
special_large_leaves_ind, margin = "cols");
-    large_internal_ind = 1 - colSums (outer (t (special_large_leaves[1,]), 
Q_large[1,], "==") * outer (t (special_large_leaves[2,]), Q_large[2,], "=="));
-    Q_large = removeEmpty (target = Q_large * large_internal_ind, margin = 
"cols");
-    F_large = removeEmpty (target = F_large, margin = "cols"); # remove 
special leaves from F
-  }
-  
-  special_small_leaves_ind = NC_small[5,];
-  num_special_small_leaf = sum (special_small_leaves_ind);
-  if (num_special_small_leaf > 0) {
-    print ("PROCESSING " + num_special_small_leaf + " SPECIAL SMALL 
LEAVES...");
-    special_small_leaves = removeEmpty (target = NC_small[1:2,] * 
special_small_leaves_ind, margin = "cols");
-    small_internal_ind = 1 - colSums (outer (t (special_small_leaves[1,]), 
Q_small[1,], "==") * outer (t (special_small_leaves[2,]), Q_small[2,], "=="));
-    Q_small = removeEmpty (target = Q_small * small_internal_ind, margin = 
"cols");
-    F_small = removeEmpty (target = F_small, margin = "cols"); # remove 
special leaves from F
-  }
-  
-  # model corresponding to large internal nodes
-  no_large_internal_node = FALSE;
-  if (as.scalar (Q_large[1,1]) != 0) {
-    print ("PROCESSING LARGE INTERNAL NODES...");
-    num_large_internal = ncol (Q_large);
-    max_offset = max (max (F_large[3,]), max (F_small[3,]));
-    M1_large = matrix (0, rows = 6 + max_offset, cols = num_large_internal);
-    M1_large[1:2,] = Q_large;
-    M1_large[4:6,] = F_large;
-    # process S_large
-    cum_offsets_large = cumsum (t (F_large[3,]));
-    parfor (it in 1:num_large_internal, check = 0) {
-      start_ind = 1;
-      if (it > 1)
-        start_ind = start_ind + as.scalar (cum_offsets_large[(it - 1),]);
-      offset = as.scalar (F_large[3,it]);
-      M1_large[7:(7 + offset - 1),it] = t (S_large[1,start_ind:(start_ind + 
offset - 1)]);
-    }
-  } else {
-    print ("No LARGE internal nodes available");
-    no_large_internal_node = TRUE;
-  }
-  
-  # model corresponding to small internal nodes
-  no_small_internal_node = FALSE;
-  if (as.scalar (Q_small[1,1]) != 0) {
-    print ("PROCESSING SMALL INTERNAL NODES...");
-    num_small_internal = ncol (Q_small);
-    M1_small = matrix (0, rows = 6 + max_offset, cols = num_small_internal);
-    M1_small[1:2,] = Q_small;
-    M1_small[4:6,] = F_small;
-    # process S_small
-    cum_offsets_small = cumsum (t (F_small[3,]));
-    parfor (it in 1:num_small_internal, check = 0) {
-      start_ind = 1;
-      if (it > 1)
-        start_ind = start_ind + as.scalar (cum_offsets_small[(it - 1),]);
-      offset = as.scalar (F_small[3,it]);
-      M1_small[7:(7 + offset - 1),it] = t (S_small[1,start_ind:(start_ind + 
offset - 1)]);
-    }
-  } else {
-    print ("No SMALL internal nodes available");
-    no_small_internal_node = TRUE;
-  }
-  
-  # model corresponding to large leaf nodes
-  no_large_leaf_node = FALSE;
-  if (as.scalar (NC_large[1,1]) != 0) {
-    print ("PROCESSING LARGE LEAF NODES...");
-    num_large_leaf = ncol (NC_large);
-    M2_large = matrix (0, rows = 6 + max_offset, cols = num_large_leaf);
-    M2_large[1:2,] = NC_large[1:2,];
-    M2_large[5:7,] = NC_large[3:5,];
-  } else {
-    print ("No LARGE leaf nodes available");
-    no_large_leaf_node = TRUE;
-  }
-  
-  # model corresponding to small leaf nodes
-  no_small_leaf_node = FALSE;
-  if (as.scalar (NC_small[1,1]) != 0) {
-    print ("PROCESSING SMALL LEAF NODES...");
-    num_small_leaf = ncol (NC_small);
-    M2_small = matrix (0, rows = 6 + max_offset, cols = num_small_leaf);
-    M2_small[1:2,] = NC_small[1:2,];
-    M2_small[5:7,] = NC_small[3:5,];
-  } else {
-    print ("No SMALL leaf nodes available");
-    no_small_leaf_node = TRUE;
-  }
-  
-  if (no_large_internal_node)
-    M1 = M1_small;
-  else if (no_small_internal_node)
-    M1 = M1_large;
-  else
-    M1 = cbind (M1_large, M1_small);
-
-  if (no_large_leaf_node)
-    M2 = M2_small;
-  else if (no_small_leaf_node)
-    M2 = M2_large;
-  else
-    M2 = cbind (M2_large, M2_small);
-
-  M = cbind (M1, M2);
-  M = t (order (target = t (M), by = 1)); # sort by node id
-  M = t (order (target = t (M), by = 2)); # sort by tree id
-  
-  # removing redundant subtrees
-  if (ncol (M) > 1) {
-    print ("CHECKING FOR REDUNDANT SUBTREES...");
-    red_leaf = TRUE;
-    process_red_subtree = FALSE;
-    invalid_node_ind = matrix (0, rows = 1, cols = ncol (M));
-    while (red_leaf & ncol (M) > 1) {
-      leaf_ind = (M[4,] == 0);
-      labels = M[5,] * leaf_ind;
-      tree_ids = M[2,];
-      parent_ids = floor (M[1,] /2);
-      cond1 = (labels[,1:(ncol (M) - 1)] == labels[,2:ncol (M)]); # sibling 
leaves with same label
-      cond2 = (parent_ids[,1:(ncol (M) - 1)] == parent_ids[,2:ncol (M)]); # 
same parents
-      cond3 = (tree_ids[,1:(ncol (M) - 1)] == tree_ids[,2:ncol (M)]); # same 
tree
-      red_leaf_ind =  cond1 * cond2 * cond3 * leaf_ind[,2:ncol (M)];
-  
-      if (sum (red_leaf_ind) > 0) { # if redundant subtrees exist
-        red_leaf_ids = M[1:2,2:ncol (M)] * red_leaf_ind;
-        red_leaf_ids_nonzero = removeEmpty (target = red_leaf_ids, margin = 
"cols");
-        parfor (it in 1:ncol (red_leaf_ids_nonzero), check = 0){
-          cur_right_leaf_id = as.scalar (red_leaf_ids_nonzero[1,it]);
-          cur_parent_id = floor (cur_right_leaf_id / 2);
-          cur_tree_id = as.scalar (red_leaf_ids_nonzero[2,it]);
-          cur_right_leaf_pos = as.scalar (rowIndexMax ((M[1,] == 
cur_right_leaf_id) * (M[2,] == cur_tree_id)));
-          cur_parent_pos = as.scalar(rowIndexMax ((M[1,] == cur_parent_id) * 
(M[2,] == cur_tree_id)));
-          M[3:nrow (M), cur_parent_pos] = M[3:nrow (M), cur_right_leaf_pos];
-          M[4,cur_right_leaf_pos] = -1;
-          M[4,cur_right_leaf_pos - 1] = -1;
-          invalid_node_ind[1,cur_right_leaf_pos] = 1;
-          invalid_node_ind[1,cur_right_leaf_pos - 1] = 1;
-        }
-        process_red_subtree = TRUE;
-      } else {
-        red_leaf = FALSE;
-      }
-    }
-
-    if (process_red_subtree) {
-      print ("REMOVING REDUNDANT SUBTREES...");
-      valid_node_ind = (invalid_node_ind == 0);
-      M = removeEmpty (target = M * valid_node_ind, margin = "cols");
-    }
-  }
-
-  internal_ind = (M[4,] > 0);
-  internal_ids = M[1:2,] * internal_ind;
-  internal_ids_nonzero = removeEmpty (target = internal_ids, margin = "cols");
-  if (as.scalar (internal_ids_nonzero[1,1]) > 0) { # if internal nodes exist
-      a1 = internal_ids_nonzero[1,];
-      a2 = internal_ids_nonzero[1,] * 2;
-      vcur_tree_id = internal_ids_nonzero[2,];
-      pos_a1 = rowIndexMax( outer(t(a1), M[1,], "==") * outer(t(vcur_tree_id), 
M[2,], "==") );
-      pos_a2 = rowIndexMax( outer(t(a2), M[1,], "==") * outer(t(vcur_tree_id), 
M[2,], "==") );
-      M[3,] = t(table(pos_a1, 1, pos_a2 - pos_a1, ncol(M), 1));
-  }
-  else {
-      print ("All trees in the random forest contain only one leaf!");
+  if(verbose) {
+    print("randomForest: trained ensemble with num_trees="+num_trees+" in 
"+(time()-t1)/1e9+" seconds.");
   }
 }
diff --git a/scripts/builtin/slicefinder.dml b/scripts/builtin/slicefinder.dml
index 3005c8c764..0689280c12 100644
--- a/scripts/builtin/slicefinder.dml
+++ b/scripts/builtin/slicefinder.dml
@@ -27,8 +27,8 @@
 #
 # INPUT:
 # 
---------------------------------------------------------------------------------------
-# X        Recoded dataset into Matrix
-# e        Trained model
+# X        Feature matrix in recoded/binned representation
+# e        Error vector of trained model
 # k        Number of subsets required
 # maxL     maximum level L (conjunctions of L predicates), 0 unlimited
 # minSup   minimum support (min number of rows per slice)
diff --git 
a/src/test/java/org/apache/sysds/test/functions/builtin/part2/BuiltinRandomForestTest.java
 
b/src/test/java/org/apache/sysds/test/functions/builtin/part2/BuiltinRandomForestTest.java
index 1432ca2231..08eae79dff 100644
--- 
a/src/test/java/org/apache/sysds/test/functions/builtin/part2/BuiltinRandomForestTest.java
+++ 
b/src/test/java/org/apache/sysds/test/functions/builtin/part2/BuiltinRandomForestTest.java
@@ -19,7 +19,6 @@
 
 package org.apache.sysds.test.functions.builtin.part2;
 
-import org.junit.Ignore;
 import org.junit.Test;
 import org.junit.runner.RunWith;
 import org.junit.runners.Parameterized;
@@ -51,34 +50,29 @@ public class BuiltinRandomForestTest extends 
AutomatedTestBase
        @Parameterized.Parameter(1)
        public int cols;
        @Parameterized.Parameter(2)
-       public int bins;
-       @Parameterized.Parameter(3)
        public int depth;
-       @Parameterized.Parameter(4)
+       @Parameterized.Parameter(3)
        public int num_leaf;
-       @Parameterized.Parameter(5)
+       @Parameterized.Parameter(4)
        public int num_trees;
-       @Parameterized.Parameter(6)
+       @Parameterized.Parameter(5)
        public String impurity;
        
        @Parameterized.Parameters
        public static Collection<Object[]> data() {
                return Arrays.asList(new Object[][] {
-                       //TODO fix randomForest script (currently indexing 
issues)
-                       {2000, 7, 4, 7, 10, 1, "Gini"},
-                       {2000, 7, 4, 7, 10, 1, "entropy"},
-                       {2000, 7, 4, 3, 5, 3, "Gini"},
-                       {2000, 7, 4, 3, 5, 3, "entropy"},
+                       {2000, 7, 7, 10, 1, "gini"},
+                       {2000, 7, 7, 10, 1, "entropy"},
+                       {1000, 7, 7, 10, 4, "gini"},
+                       {1000, 7, 7, 10, 4, "entropy"},
                });
        }
 
-       @Ignore
        @Test
        public void testRandomForestSinglenode() {
                runRandomForestTest(ExecMode.SINGLE_NODE);
        }
        
-       @Ignore
        @Test
        public void testRandomForestHybrid() {
                runRandomForestTest(ExecMode.HYBRID);
@@ -95,7 +89,7 @@ public class BuiltinRandomForestTest extends AutomatedTestBase
                        String HOME = SCRIPT_DIR + TEST_DIR;
                        fullDMLScriptName = HOME + TEST_NAME + ".dml";
                        programArgs = new String[]{"-args",
-                               input("X"), input("Y"), String.valueOf(bins),
+                               input("X"), input("Y"),
                                String.valueOf(depth), String.valueOf(num_leaf),
                                String.valueOf(num_trees), impurity, 
output("B") };
 
diff --git a/src/test/scripts/functions/builtin/randomForest.dml 
b/src/test/scripts/functions/builtin/randomForest.dml
index e7b2b7120b..092d52f8cb 100644
--- a/src/test/scripts/functions/builtin/randomForest.dml
+++ b/src/test/scripts/functions/builtin/randomForest.dml
@@ -19,16 +19,25 @@
 #
 #-------------------------------------------------------------
 
-X = read($1);
+F = as.frame(read($1));
 Y = read($2);
-R = matrix(1, rows=ncol(X), cols = 3);
-bins = $3;
-depth = $4;
-num_leafs = $5;
-num_trees = $6;
-impurity = $7;
+depth = $3;
+num_leafs = $4;
+num_trees = $5;
+impurity = $6;
 
-[M, C, S_map, C_map] = randomForest(X=X, Y=Y, R=R, bins=bins, depth=depth,
-  num_leaf=num_leafs, num_trees=num_trees, impurity=impurity);
+jspec = "{ids: true, bin: ["
+  + "{id: 1, method: equi-width, numbins: 10},"
+  + "{id: 2, method: equi-width, numbins: 10},"
+  + "{id: 3, method: equi-width, numbins: 10},"
+  + "{id: 4, method: equi-width, numbins: 10},"
+  + "{id: 5, method: equi-width, numbins: 10},"
+  + "{id: 6, method: equi-width, numbins: 10},"
+  + "{id: 7, method: equi-width, numbins: 10}]}";
+[X,D] = transformencode(target=F, spec=jspec);
 
-write(M, $8);
+R = matrix(1, rows=1, cols=ncol(X));
+M = randomForest(X=X, y=Y, ctypes=R, num_trees=num_trees, seed=7,
+  max_depth=depth, min_leaf=num_leafs, impurity=impurity, verbose=TRUE);
+
+write(M, $7);

[systemds] branch main updated: [SYSTEMDS-3505] Rework randomForest builtin function and tests

Reply via email to