[GitHub] [spark] srowen commented on a change in pull request #32813: [SPARK-34591][MLLIB][WIP] Add decision tree pruning as a parameter

GitBox Mon, 14 Jun 2021 10:53:37 -0700


srowen commented on a change in pull request #32813:
URL: https://github.com/apache/spark/pull/32813#discussion_r651156438




##########
File path: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala
##########
@@ -75,6 +75,23 @@ private[ml] trait DecisionTreeParams extends PredictorParams
     " discretizing continuous features.  Must be at least 2 and at least 
number of categories" +
     " for any categorical feature.", ParamValidators.gtEq(2))
 
+  /**
+   * If true, the trained tree will undergo a 'pruning' process after training 
in which nodes
+   * that have the same class predictions will be merged.  The benefit being 
that at prediction
+   * time the tree will be 'leaner'

Review comment:
       I think the key point for users is that this is a good thing _if_ only 
interested in class predictions. If interested in class probabilities, it won't 
necessarily give the right result and should be set to `false`. The text here 
is fine just wanting to make the tradeoffs explicit. I.e. `leaner` means 
smaller and faster.

##########
File path: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala
##########
@@ -75,6 +75,23 @@ private[ml] trait DecisionTreeParams extends PredictorParams
     " discretizing continuous features.  Must be at least 2 and at least 
number of categories" +
     " for any categorical feature.", ParamValidators.gtEq(2))
 
+  /**
+   * If true, the trained tree will undergo a 'pruning' process after training 
in which nodes
+   * that have the same class predictions will be merged.  The benefit being 
that at prediction
+   * time the tree will be 'leaner'
+   * If false, the post-training tree will undergo no pruning.  The benefit 
being that you
+   * maintain the class prediction probabilities
+   * (default = false)

Review comment:
       I think as a first step we should keep it to 'true'.

##########
File path: 
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala
##########
@@ -27,7 +27,11 @@ import org.apache.spark.ml.linalg.{Vector, Vectors}
 import org.apache.spark.ml.tree._
 import org.apache.spark.ml.util.TestingUtils._
 import org.apache.spark.mllib.tree.{DecisionTreeSuite => OldDTSuite, 
EnsembleTestHelper}
-import org.apache.spark.mllib.tree.configuration.{Algo => OldAlgo, 
QuantileStrategy, Strategy => OldStrategy}
+import org.apache.spark.mllib.tree.configuration.{

Review comment:
       Undo the formatting changes here if you would - just makes it easier to 
understand and avoids code churn in the git log




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] srowen commented on a change in pull request #32813: [SPARK-34591][MLLIB][WIP] Add decision tree pruning as a parameter

Reply via email to