Alessandro Solimando created SPARK-23409: --------------------------------------------
Summary: RandomForest/DecisionTree (syntactic) pruning of redundant subtrees Key: SPARK-23409 URL: https://issues.apache.org/jira/browse/SPARK-23409 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 2.2.1 Environment: Reporter: Alessandro Solimando Improvement: redundancy elimination from decision trees where all the leaves of a given subtree share the same prediction. Benefits: * Model interpretability * Faster unitary model invocation (relevant for massive ) * Smaller model memory footprint For instance, consider the following decision tree. {panel:title=Original Decision Tree} {noformat} DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15 nodes If (feature 1 <= 0.5) If (feature 2 <= 0.5) If (feature 0 <= 0.5) Predict: 0.0 Else (feature 0 > 0.5) Predict: 0.0 Else (feature 2 > 0.5) If (feature 0 <= 0.5) Predict: 0.0 Else (feature 0 > 0.5) Predict: 0.0 Else (feature 1 > 0.5) If (feature 2 <= 0.5) If (feature 0 <= 0.5) Predict: 1.0 Else (feature 0 > 0.5) Predict: 1.0 Else (feature 2 > 0.5) If (feature 0 <= 0.5) Predict: 0.0 Else (feature 0 > 0.5) Predict: 0.0 {noformat} {panel} The proposed method, taken as input the first tree, aims at producing as output the following (semantically equivalent) tree: {panel:title=Pruned Decision Tree} {noformat} DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15 nodes If (feature 1 <= 0.5) Predict: 0.0 Else (feature 1 > 0.5) If (feature 2 <= 0.5) Predict: 1.0 Else (feature 2 > 0.5) Predict: 0.0 {noformat} {panel} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org