Github user sethah commented on the issue:
So, taking a look at the current patch, the API for this "loss-based"
impurity feels clunky and a bit confusing. To enumerate, we have the following
**1. Completely decoupled loss and impurity** - Spark 2.0 and before
**2. Terminal node updates** - trees trained with _some_ impurity, but
predictions are optimal (this patch), aka "TreeBoost"
**3. Train trees with loss directly**- this incorporates both terminal
node refinements and minimizing the loss directly during tree training. This
could be done in the future and is how XGBoost works.
I don't exactly see why we need to maintain 1 at all. Terminal node
refinements should be a strict improvement over 1. By attempting to maintain
both options, we expose an "impurity" parameter for GBTs, which IMO does not
need to be public. To me, it is confusing also. e.g.
res1: String = loss-based
What exactly does that mean to the user? I think that, ideally in the
future, we would support 2 and 3 and users could select between the two. I can
envision there being trade-offs between 2 and 3, but in theory 2 should be a
strict superset of 1.
AFAIK, scikit does not expose the option to perform terminal-node updates
to the user and I am not even sure there is a paper documenting gradient tree
boosting without explicitly performing the terminal node updates. For instance,
we have previously referred to this paper
https://statweb.stanford.edu/~jhf/ftp/stobst.pdf (please correct me if I'm
wrong here) in the code as justification for describing the current approach as
"Stochastic Gradient Boosting." But in the paper both sections 1 and 2 use
terminal node refinements.
Silently changing the algorithm could be tricky, but if we can be certain
that this change is a strict improvement then I'm not sure it's a problem. I'm
curious to hear others' thoughts on this.
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org