Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/21129#discussion_r187112582
--- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala ---
@@ -460,18 +461,29 @@ private[ml] trait RandomForestRegressorParams
*
* Note: Marked as private and DeveloperApi since this may be made public
in the future.
*/
-private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter
with HasStepSize {
+private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter
with HasStepSize
+ with HasValidationIndicatorCol {
- /* TODO: Add this doc when we add this param. SPARK-7132
- * Threshold for stopping early when runWithValidation is used.
+ /**
+ * Threshold for stopping early when fit with validation is used.
* If the error rate on the validation input changes by less than the
validationTol,
- * then learning will stop early (before [[numIterations]]).
- * This parameter is ignored when run is used.
+ * then learning will stop early (before [[maxIter]]).
+ * This parameter is ignored when fit without validation is used.
* (default = 1e-5)
--- End diff --
I forget why we chose 1e-5 (which is different from spark.mllib). What do
you think about using 0.01 to match the sklearn docs here?
http://scikit-learn.org/dev/auto_examples/ensemble/plot_gradient_boosting_early_stopping.html
(I also checked xgboost, but they use a different approach based on x number
of steps without improvement. We may want to add that at some point since it
sounds more robust.)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]