srowen commented on a change in pull request #32813:
URL: https://github.com/apache/spark/pull/32813#discussion_r651156438
##########
File path: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala
##########
@@ -75,6 +75,23 @@ private[ml] trait DecisionTreeParams extends PredictorParams
" discretizing continuous features. Must be at least 2 and at least
number of categories" +
" for any categorical feature.", ParamValidators.gtEq(2))
+ /**
+ * If true, the trained tree will undergo a 'pruning' process after training
in which nodes
+ * that have the same class predictions will be merged. The benefit being
that at prediction
+ * time the tree will be 'leaner'
Review comment:
I think the key point for users is that this is a good thing _if_ only
interested in class predictions. If interested in class probabilities, it won't
necessarily give the right result and should be set to `false`. The text here
is fine just wanting to make the tradeoffs explicit. I.e. `leaner` means
smaller and faster.
##########
File path: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala
##########
@@ -75,6 +75,23 @@ private[ml] trait DecisionTreeParams extends PredictorParams
" discretizing continuous features. Must be at least 2 and at least
number of categories" +
" for any categorical feature.", ParamValidators.gtEq(2))
+ /**
+ * If true, the trained tree will undergo a 'pruning' process after training
in which nodes
+ * that have the same class predictions will be merged. The benefit being
that at prediction
+ * time the tree will be 'leaner'
+ * If false, the post-training tree will undergo no pruning. The benefit
being that you
+ * maintain the class prediction probabilities
+ * (default = false)
Review comment:
I think as a first step we should keep it to 'true'.
##########
File path:
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala
##########
@@ -27,7 +27,11 @@ import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.tree._
import org.apache.spark.ml.util.TestingUtils._
import org.apache.spark.mllib.tree.{DecisionTreeSuite => OldDTSuite,
EnsembleTestHelper}
-import org.apache.spark.mllib.tree.configuration.{Algo => OldAlgo,
QuantileStrategy, Strategy => OldStrategy}
+import org.apache.spark.mllib.tree.configuration.{
Review comment:
Undo the formatting changes here if you would - just makes it easier to
understand and avoids code churn in the git log
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]