Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/14547
  
    @sethah I agree with you that the original TreeBoost does not use the loss 
to choose the structure of the tree; it only uses the loss to recompute example 
labels and to choose predicted values at leaf nodes.  But as Vlad said, xgboost 
uses the loss to choose tree structure, which intuitively should help the GBM 
to fit the data faster.  Vlad's design allows testing vs. TreeBoost by setting 
impurity and loss separately, as well as testing vs. xgboost by setting 
impurity to be "loss-based."
    * One question is whether we should change the default impurity to be 
"loss-based," which will change behavior to be closer to xgboost.
    
    @vlad17  Test gists: I had a few questions about the gists you referenced 
in the PR description for comparing MLlib with R's gbm.
    * ```setMinInstancesPerNode(10)```: For MLlib, you set 
minInstancesPerNode=10.  Is this the same value used by gbm by default?  I'm 
trying to match up how the tests were run.
    * At one point, you have the MLlib script output the value ```counts.max / 
counts.sum```.  I wasn't sure what the value was for.  My guess was that it was 
a sanity check to verify that the train/test splits are identical across tests, 
but I don't see it output by the gbm script.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to