Github user mengxr commented on a diff in the pull request:
https://github.com/apache/spark/pull/2435#discussion_r18019000
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala
---
@@ -128,13 +139,34 @@ private[tree] object DecisionTreeMetadata {
}
}
+ // Set number of features to use per node (for random forests).
+ val _featureSubsetStrategy = featureSubsetStrategy match {
+ case "auto" => if (numTrees == 1) "all" else "sqrt"
+ case _ => featureSubsetStrategy
+ }
+ val numFeaturesPerNode: Int = _featureSubsetStrategy match {
+ case "all" => numFeatures
+ case "sqrt" => math.sqrt(numFeatures).ceil.toInt
+ case "log2" => math.max(1, (math.log(numFeatures) /
math.log(2)).ceil.toInt)
--- End diff --
The `log2` is from Breiman's paper:
http://www.stat.berkeley.edu/~breiman/randomforest2001.pdf
From R's randomForest doc:
> Note that the default values are different for classification (sqrt(p)
where p is number of
variables in x) and regression (p/3)
From http://www.stat.berkeley.edu/~breiman/Using_random_forests_V3.1.pdf
> this is the only parameter that requires some judgment to set, but
forests isn't too sensitive to its value as long as it's in the right ball
park. I have found that setting mtry equal to the square root of
mdim gives generally near optimum results. My advice is to begin
with this value and try a value twice as high and half as low
monitoring the results by setting look=1 and checking the internal
test set error for a small number of trees. With many noise
variables present, mtry has to be set higher.
Let's set the default to `sqrt`, keep `log2` and `onethird`, and mention
the references in the doc or comments.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]