Github user jkbradley commented on the issue:
https://github.com/apache/spark/pull/14547
@sethah I agree with you that the original TreeBoost does not use the loss
to choose the structure of the tree; it only uses the loss to recompute example
labels and to choose predicted values at leaf nodes. But as Vlad said, xgboost
uses the loss to choose tree structure, which intuitively should help the GBM
to fit the data faster. Vlad's design allows testing vs. TreeBoost by setting
impurity and loss separately, as well as testing vs. xgboost by setting
impurity to be "loss-based."
* One question is whether we should change the default impurity to be
"loss-based," which will change behavior to be closer to xgboost.
@vlad17 Test gists: I had a few questions about the gists you referenced
in the PR description for comparing MLlib with R's gbm.
* ```setMinInstancesPerNode(10)```: For MLlib, you set
minInstancesPerNode=10. Is this the same value used by gbm by default? I'm
trying to match up how the tests were run.
* At one point, you have the MLlib script output the value ```counts.max /
counts.sum```. I wasn't sure what the value was for. My guess was that it was
a sanity check to verify that the train/test splits are identical across tests,
but I don't see it output by the gbm script.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]