Github user cthom commented on the pull request:
https://github.com/apache/spark/pull/3951#issuecomment-71731512
Is there anyway to maintain some kind state about the model as it's being
built? For GBT models, one usually sees a plot of the error vs number of trees
in the model. If the model setup incorporates a hold-out or test/validation
data set, we can determine after the fact the optimal number of trees in the
model (any more and it starts to overfit).
At the moment, my solution has been to extract the trees from the model,
and iteratively recreate a sub-model, and score the test data against the
submodel. But this is fairly expensive. I figure there must be an internal
assessment of model performance at each step of the building phase....if this
was retained, I think there would be a lot of value. I'm a little unsure how to
implement it though.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]