vedelsbrunner commented on a change in pull request #1334: URL: https://github.com/apache/systemds/pull/1334#discussion_r671325137
########## File path: scripts/builtin/xgboost.dml ########## @@ -0,0 +1,780 @@ +# INPUT PARAMETERS: +# --------------------------------------------------------------------------------------------- +# NAME TYPE DEFAULT MEANING +# --------------------------------------------------------------------------------------------- +# X Matrix[Double] --- Feature matrix X; note that X needs to be both recoded and dummy coded +# Y Matrix[Double] --- Label matrix Y; note that Y needs to be both recoded and dummy coded +# R Matrix[Double] 1, 1xn Matrix R; 1xn vector which for each feature in X contains the following information +# - R[,1]: 1 (scalar feature) +# - R[,2]: 2 (categorical feature) +# Feature 1 is a scalar feature and features 2 is a categorical feature +# If R is not provided by default all variables are assumed to be scale (1) +# sml_type Integer 1 Supervised machine learning type: 1 = Regression(default), 2 = Classification +# num_trees Integer 7 Number of trees to be created in the xgboost model +# learning_rate Double 0.3 Alias: eta. After each boosting step the learning rate controls the weights of the new predictions +# max_depth Integer 6 Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit +# lambda Double 0.0 L2 regularization term on weights. Increasing this value will make model more conservative and reduce amount of leaves of a tree +# --------------------------------------------------------------------------------------------- + +# --------------------------------------------------------------------------------------------- +# OUTPUT: +# Matrix M where each column corresponds to a node in the learned tree (the first node is the init prediction) and each row contains the following information: +# M[1,j]: id of node j (in a complete binary tree) +# M[2,j]: tree id to which node j belongs +# M[3,j]: Offset (no. of columns) to left child of j if j is an internal node, otherwise 0 +# M[4,j]: Feature index of the feature (scale feature id if the feature is scale or categorical feature id if the feature is categorical) +# that node j looks at if j is an internal node, otherwise 0 +# M[5,j]: Type of the feature that node j looks at if j is an internal node. if leaf = 0, if scalar = 1, if categorical = 2 +# M[6:,j]: If j is an internal node: Threshold the example's feature value is compared to is stored at M[6,j] if the feature chosen for j is scale, +# otherwise if the feature chosen for j is categorical rows 6,7,... depict the value subset chosen for j +# If j is a leaf node 1 if j is impure and the number of samples at j > threshold, otherwise 0 +# ------------------------------------------------------------------------------------------- + +m_xgboost = function(Matrix[Double] X, Matrix[Double] y, Matrix[Double] R = matrix(1,rows=1,cols=nrow(X)), + Integer sml_type = 1, Integer num_trees = 7, Double learning_rate = 0.3, Integer max_depth = 6, Double lambda = 0.0) + return (Matrix[Double] M) { + # test if input correct + assert(nrow(X) == nrow(y)) + assert(ncol(y) == 1) + assert(nrow(R) == 1) + + M = matrix(0,rows=6,cols=0) + # set the init prediction at first col in M + init_prediction_matrix = matrix("0 0 0 0 0 0",rows=nrow(M),cols=1) + init_prediction_matrix[6,1] = median(y) + M = cbind(M, init_prediction_matrix) + + current_prediction = matrix(median(y), rows=nrow(y), cols=1) + + tree_id = 1 + while(tree_id <= num_trees) + { + curr_M = buildOneTree(X, y, R, sml_type, max_depth, current_prediction, tree_id, lambda) + + # in current prediction all previous trees are considered, so we only add the current tree to calculate new predictions + current_prediction = calculateNewPredictions(X, sml_type,current_prediction, learning_rate, curr_M) + + tree_id = tree_id + 1 + M = cbind(M, curr_M) # concat the new tree to the existing one (forest-ing) + } +} + + +#----------------------------------------------------------------------------------------------------------------------- +# INPUT: X: nxn matrix, original input matrix +# INPUT: current_prediction: nx1 vector of the current prediction for my target features y (1st run is init prediction) +# INPUT: learning_rate: set by user +# INPUT: curr_M: The current M matrix with the current tree +# OUTPUT: new_prediction: x1 vector of new new_prediction for my target features y +calculateNewPredictions = function(Matrix[Double] X, Integer sml_type, Matrix[Double] current_prediction, + Double learning_rate, Matrix[Double] curr_M) + return (Matrix[Double] new_prediction) { + assert(ncol(current_prediction) == 1) + assert(nrow(current_prediction) == nrow(X)) # each Entry should have an own initial prediction Review comment: done in https://github.com/apache/systemds/pull/1334/commits/59c408c4a4b676cdeb3fa87a100f08517380ab95 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@systemds.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org