Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/14547 @sethah I agree with you that the original TreeBoost does not use the loss to choose the structure of the tree; it only uses the loss to recompute example labels and to choose predicted values at leaf nodes. But as Vlad said, xgboost uses the loss to choose tree structure, which intuitively should help the GBM to fit the data faster. Vlad's design allows testing vs. TreeBoost by setting impurity and loss separately, as well as testing vs. xgboost by setting impurity to be "loss-based." * One question is whether we should change the default impurity to be "loss-based," which will change behavior to be closer to xgboost. @vlad17 Test gists: I had a few questions about the gists you referenced in the PR description for comparing MLlib with R's gbm. * ```setMinInstancesPerNode(10)```: For MLlib, you set minInstancesPerNode=10. Is this the same value used by gbm by default? I'm trying to match up how the tests were run. * At one point, you have the MLlib script output the value ```counts.max / counts.sum```. I wasn't sure what the value was for. My guess was that it was a sanity check to verify that the train/test splits are identical across tests, but I don't see it output by the gbm script.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org