vedelsbrunner commented on a change in pull request #1334:
URL: https://github.com/apache/systemds/pull/1334#discussion_r671325137



##########
File path: scripts/builtin/xgboost.dml
##########
@@ -0,0 +1,780 @@
+# INPUT         PARAMETERS:
+# 
---------------------------------------------------------------------------------------------
+# NAME          TYPE             DEFAULT      MEANING
+# 
---------------------------------------------------------------------------------------------
+# X             Matrix[Double]    ---         Feature matrix X; note that X 
needs to be both recoded and dummy coded
+# Y             Matrix[Double]    ---         Label matrix Y; note that Y 
needs to be both recoded and dummy coded
+# R                      Matrix[Double]    1, 1xn          Matrix R; 1xn 
vector which for each feature in X contains the following information
+#                                                                              
                          - R[,1]: 1 (scalar feature)
+#                                                                              
                          - R[,2]: 2 (categorical feature)
+#                                             Feature 1 is a scalar feature 
and features 2 is a categorical feature
+#                                             If R is not provided by default 
all variables are assumed to be scale (1)
+# sml_type      Integer           1           Supervised machine learning 
type: 1 = Regression(default), 2 = Classification
+# num_trees     Integer           7           Number of trees to be created in 
the xgboost model
+# learning_rate Double            0.3         Alias: eta. After each boosting 
step the learning rate controls the weights of the new predictions
+# max_depth     Integer           6           Maximum depth of a tree. 
Increasing this value will make the model more complex and more likely to 
overfit
+# lambda        Double            0.0         L2 regularization term on 
weights. Increasing this value will make model more conservative and reduce 
amount of leaves of a tree
+# 
---------------------------------------------------------------------------------------------
+
+# 
---------------------------------------------------------------------------------------------
+# OUTPUT:
+# Matrix M where each column corresponds to a node in the learned tree (the 
first node is the init prediction) and each row contains the following 
information:
+#  M[1,j]: id of node j (in a complete binary tree)
+#  M[2,j]: tree id to which node j belongs
+#  M[3,j]: Offset (no. of columns) to left child of j if j is an internal 
node, otherwise 0
+#  M[4,j]: Feature index of the feature (scale feature id if the feature is 
scale or categorical feature id if the feature is categorical)
+#    that node j looks at if j is an internal node, otherwise 0
+#  M[5,j]: Type of the feature that node j looks at if j is an internal node. 
if leaf = 0, if scalar = 1, if categorical = 2
+#  M[6:,j]: If j is an internal node: Threshold the example's feature value is 
compared to is stored at M[6,j] if the feature chosen for j is scale,
+#     otherwise if the feature chosen for j is categorical rows 6,7,... depict 
the value subset chosen for j
+#     If j is a leaf node 1 if j is impure and the number of samples at j > 
threshold, otherwise 0
+# 
-------------------------------------------------------------------------------------------
+
+m_xgboost = function(Matrix[Double] X, Matrix[Double] y, Matrix[Double] R = 
matrix(1,rows=1,cols=nrow(X)),
+  Integer sml_type = 1, Integer num_trees = 7, Double learning_rate = 0.3, 
Integer max_depth = 6, Double lambda = 0.0)
+  return (Matrix[Double] M) {
+  # test if input correct
+  assert(nrow(X) == nrow(y))
+  assert(ncol(y) == 1)
+  assert(nrow(R) == 1)
+
+  M = matrix(0,rows=6,cols=0)
+  # set the init prediction at first col in M
+  init_prediction_matrix = matrix("0 0 0 0 0 0",rows=nrow(M),cols=1)
+  init_prediction_matrix[6,1] = median(y)
+  M = cbind(M, init_prediction_matrix)
+
+  current_prediction = matrix(median(y), rows=nrow(y), cols=1)
+
+  tree_id = 1
+  while(tree_id <= num_trees)
+  {
+    curr_M = buildOneTree(X, y, R, sml_type, max_depth, current_prediction, 
tree_id, lambda)
+
+    # in current prediction all previous trees are considered, so we only add 
the current tree to calculate new predictions
+    current_prediction = calculateNewPredictions(X, 
sml_type,current_prediction, learning_rate, curr_M)
+
+    tree_id = tree_id + 1
+    M = cbind(M, curr_M) # concat the new tree to the existing one (forest-ing)
+  }
+}
+
+
+#-----------------------------------------------------------------------------------------------------------------------
+# INPUT:    X: nxn matrix, original input matrix
+# INPUT:    current_prediction: nx1 vector of the current prediction for my 
target features y (1st run is init prediction)
+# INPUT:    learning_rate: set by user
+# INPUT:    curr_M: The current M matrix with the current tree
+# OUTPUT:   new_prediction: x1 vector of new new_prediction for my target 
features y
+calculateNewPredictions = function(Matrix[Double] X, Integer sml_type, 
Matrix[Double] current_prediction,
+    Double learning_rate, Matrix[Double] curr_M)
+    return (Matrix[Double] new_prediction) {
+  assert(ncol(current_prediction) == 1)
+  assert(nrow(current_prediction) == nrow(X)) # each Entry should have an own 
initial prediction

Review comment:
       done in 
https://github.com/apache/systemds/pull/1334/commits/59c408c4a4b676cdeb3fa87a100f08517380ab95




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@systemds.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to