I would like to make a related suggestion but instead of focusing on the upper 
bound for the number of trees rather set choosing the lower bound. From a 
theoretical perspective, it doesn't make sense to me how fewer trees can result 
in a better performing random forest model in terms of generalization 
performance. If you observe a better performance on the same independent test 
set with fewer trees, I would say that this is likely not a good indicator of 
better generalization performance. It could be due to overfitting and 
train/test set resampling and/or picking up artifacts in the dataset. 

As a general suggestion, I would suggest choosing a reasonable number of trees 
that seems computationally feasible given the size of the dataset and the 
number hyperparameters to compare via model selection. Then, after tuning, I 
would use the best hyperparameter setting with 10x more trees and see if you 
notice any significant different in the cross-validation performance. Next, I 
would use the model and fit it to the whole training set with those best 
hyperparameters and evaluate the performance on the independent test set.

Best,
Sebastian


> On Dec 24, 2018, at 9:27 PM, Brown J.B. via scikit-learn 
> <scikit-learn@python.org> wrote:
> 
> Take random forest as example, if I give estimator from 10 to 10000(10, 100, 
> 1000, 10000) into grid search.
> Based on the result, I found estimator=100 is the best, but I don't know 
> lower or greater than 100 is better.
> How should I decide? brute force or any tools better than GridSearchCV?
> 
> A simple but nonetheless practical solution is to 
>   (1) start with an upper bound on the number of trees you are willing to 
> accept in the model, 
>   (2) obtain its performance (ACC, MCC, F1, etc) as the starting reference 
> point,
>   (3) systematically lower the number of trees (log2 scale down, fixed size 
> decrement, etc)
>   (4) obtain the reduced forest size performance,
>   (5) Repeat (3)-(4) until [performance(reference) - performance(current 
> forest size)] > tolerance
> 
> You can encapsulate that in a function which then returns the final model you 
> obtain.
> From the model object, the number of trees can be obtained.
> 
> J.B.
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to