Re: [Scikit-learn-general] Random forest low score on testing data

Luca Puggini Fri, 05 Feb 2016 12:47:34 -0800

The number of trees (n estimators) should be as much large as possible.  It
does not cause over fitting.  In random forest over fitting is usually
caused by the depth  and by variables with several unique values.  I'll
suggest you to start using randomized trees with low depth.  If you want to
use rf you can try to reduce the number of variables used at each split.


Observe that if you use OOB to estimate the prediction error it may be
biased when  the number of trees is large.

In addition I'll suggest you to shuffle the data at the beginning if you
can.

On Fri, Feb 5, 2016, 5:14 PM muhammad waseem <m.waseem.ah...@gmail.com>
wrote:

> Thanks Luca, I will give it a try. When you say extremely randomised, does
> this mean using large number of n_estimators?
>
> Also, any idea how to solve overfitting problem for random forest?
>
> Regards
> Waseem
>
>
> On Fri, Feb 5, 2016 at 5:00 PM, Luca Puggini <lucapug...@gmail.com> wrote:
>
>> Here there are the extra trees
>> http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html#sklearn.ensemble.ExtraTreesRegressor
>>
>> it work similarly to random forest.  In my experience RF tends often to
>> overfit.
>> I suggest you to start using the default parameters and cross validate
>> only on the max_depth parameter.  Start with small values of max_depth [2,
>> 3, 5, 7, 10] and check how the performances of the model change.
>>
>> Good Luck.
>> Luca
>>
>> On Fri, Feb 5, 2016 at 4:28 PM muhammad waseem <m.waseem.ah...@gmail.com>
>> wrote:
>>
>>> Hi Luca,
>>> Could you please explain how can do this randomized trees in
>>> scikit-learn? So you suggest I should be using Random forest?
>>>
>>>
>>> On Fri, Feb 5, 2016 at 4:13 PM, Luca Puggini <lucapug...@gmail.com>
>>> wrote:
>>>
>>>> To me the score is not so low. The model is slightly over fitting. Try
>>>> to repeat the same process with extremely randomized trees instead of
>>>> random forest and try to keep a low depth.
>>>> On Fri 5 Feb 2016 at 16:01 muhammad waseem <m.waseem.ah...@gmail.com>
>>>> wrote:
>>>>
>>>>> Dear All,
>>>>> I am trying to train my model using Scikit-learn's Random forest
>>>>> (Regression) and have tried to use GridSearch with Cross-validation (CV=5)
>>>>> to tune hyperparameters. I fixed n_estimators =2000 for all cases. Below
>>>>> are the few searches that I performed.
>>>>>
>>>>> 1) max_features :[1,3,5], max_depth :[1,5,10,15],
>>>>> min_samples_split:[2,6,8,10], bootstrap:[True, False]
>>>>> The best were max_features=5, max_depth = 15, min_samples_split:10,
>>>>> bootstrap=True
>>>>> Best score = 0.8724
>>>>>
>>>>> Then I searched close to the parameters that were best;
>>>>> 2) max_features :[3,5,6], max_depth :[10,20,30,40],
>>>>> min_samples_split:[8,16,20,24], bootstrap:[True, False]
>>>>> The best were max_features=5, max_depth = 30, min_samples_split:20,
>>>>> bootstrap=True
>>>>> Best score = 0.8722
>>>>>
>>>>> Again, I searched close to the parameters that were best;
>>>>> 3) max_features :[2,4,6], max_depth :[25,35,40,50],
>>>>> min_samples_split:[22,28,34,40], bootstrap:[True, False]
>>>>>
>>>>> The best were max_features=4, max_depth = 25, min_samples_split:22,
>>>>> bootstrap=True
>>>>> Best score = 0.8725
>>>>>
>>>>> Then I used GridSearch among the best parameters found in the above
>>>>> runs and found the best on as max_features=4, max_depth = 15,
>>>>> min_samples_split:10,
>>>>> Best score = 0.8729
>>>>>
>>>>> Then I used these parameters to predict for an unknown dataset but got
>>>>> a very low score (around 0.72).
>>>>>
>>>>> My questions are; Am I doing the hyperparameter tuning correctly or I
>>>>> am missing something?
>>>>>
>>>>> 2) Why is my testing score very low as compared to my training and
>>>>> validation score and how can I improve it so that I get good predictions
>>>>> out of my model?
>>>>>
>>>>> Sorry, if these are basic questions as I am new to scikit-learn and ML.
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>>>> Monitor end-to-end web transactions and take corrective actions now
>>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>> --
>>>>
>>>> Sent by mobile phone
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>>> Monitor end-to-end web transactions and take corrective actions now
>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>> Monitor end-to-end web transactions and take corrective actions now
>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>> --
>>
>> Sent by mobile phone
>>
>>
>> ------------------------------------------------------------------------------
>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>> Monitor end-to-end web transactions and take corrective actions now
>> Troubleshoot faster and improve end-user experience. Signup Now!
>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
-- 

Sent by mobile phone

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random forest low score on testing data

Reply via email to