[GitHub] spark pull request: [SPARK-5094][MLlib] Add Python API for Gradien...

cthom Tue, 27 Jan 2015 13:28:52 -0800

Github user cthom commented on the pull request:

    https://github.com/apache/spark/pull/3951#issuecomment-71731512
  
    Is there anyway to maintain some kind state about the model as it's being 
built? For GBT models, one usually sees a plot of the error vs number of trees 
in the model. If the model setup incorporates a hold-out or test/validation 
data set, we can determine after the fact the optimal number of trees in the 
model (any more and it starts to overfit).
    
    At the moment, my solution has been to extract the trees from the model, 
and iteratively recreate a sub-model, and score the test data against the 
submodel. But this is fairly expensive. I figure there must be an internal 
assessment of model performance at each step of the building phase....if this 
was retained, I think there would be a lot of value. I'm a little unsure how to 
implement it though.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-5094][MLlib] Add Python API for Gradien...

Reply via email to