Re: [Scikit-learn-general] Random forest low score on testing data

muhammad waseem Fri, 05 Feb 2016 13:34:47 -0800

Hi Luca,
Thanks for your time and answer. I will try this with lower max_depth (both
for randomised and RF to see what happens)*.*
By number of variable used at each split, you mean min_samples_split, right?


I did not use OOB score.
I will also try to shuffle my data as well.

Thanks again.


On Fri, Feb 5, 2016 at 8:46 PM, Luca Puggini <[email protected]> wrote:

> The number of trees (n estimators) should be as much large as possible.
> It does not cause over fitting.  In random forest over fitting is usually
> caused by the depth  and by variables with several unique values.  I'll
> suggest you to start using randomized trees with low depth.  If you want to
> use rf you can try to reduce the number of variables used at each split.
>
> Observe that if you use OOB to estimate the prediction error it may be
> biased when  the number of trees is large.
>
> In addition I'll suggest you to shuffle the data at the beginning if you
> can.
>
> On Fri, Feb 5, 2016, 5:14 PM muhammad waseem <[email protected]>
> wrote:
>
>> Thanks Luca, I will give it a try. When you say extremely randomised,
>> does this mean using large number of n_estimators?
>>
>> Also, any idea how to solve overfitting problem for random forest?
>>
>> Regards
>> Waseem
>>
>>
>> On Fri, Feb 5, 2016 at 5:00 PM, Luca Puggini <[email protected]>
>> wrote:
>>
>>> Here there are the extra trees
>>> http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html#sklearn.ensemble.ExtraTreesRegressor
>>>
>>> it work similarly to random forest.  In my experience RF tends often to
>>> overfit.
>>> I suggest you to start using the default parameters and cross validate
>>> only on the max_depth parameter.  Start with small values of max_depth [2,
>>> 3, 5, 7, 10] and check how the performances of the model change.
>>>
>>> Good Luck.
>>> Luca
>>>
>>> On Fri, Feb 5, 2016 at 4:28 PM muhammad waseem <[email protected]>
>>> wrote:
>>>
>>>> Hi Luca,
>>>> Could you please explain how can do this randomized trees in
>>>> scikit-learn? So you suggest I should be using Random forest?
>>>>
>>>>
>>>> On Fri, Feb 5, 2016 at 4:13 PM, Luca Puggini <[email protected]>
>>>> wrote:
>>>>
>>>>> To me the score is not so low. The model is slightly over fitting. Try
>>>>> to repeat the same process with extremely randomized trees instead of
>>>>> random forest and try to keep a low depth.
>>>>> On Fri 5 Feb 2016 at 16:01 muhammad waseem <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Dear All,
>>>>>> I am trying to train my model using Scikit-learn's Random forest
>>>>>> (Regression) and have tried to use GridSearch with Cross-validation 
>>>>>> (CV=5)
>>>>>> to tune hyperparameters. I fixed n_estimators =2000 for all cases. Below
>>>>>> are the few searches that I performed.
>>>>>>
>>>>>> 1) max_features :[1,3,5], max_depth :[1,5,10,15],
>>>>>> min_samples_split:[2,6,8,10], bootstrap:[True, False]
>>>>>> The best were max_features=5, max_depth = 15, min_samples_split:10,
>>>>>> bootstrap=True
>>>>>> Best score = 0.8724
>>>>>>
>>>>>> Then I searched close to the parameters that were best;
>>>>>> 2) max_features :[3,5,6], max_depth :[10,20,30,40],
>>>>>> min_samples_split:[8,16,20,24], bootstrap:[True, False]
>>>>>> The best were max_features=5, max_depth = 30, min_samples_split:20,
>>>>>> bootstrap=True
>>>>>> Best score = 0.8722
>>>>>>
>>>>>> Again, I searched close to the parameters that were best;
>>>>>> 3) max_features :[2,4,6], max_depth :[25,35,40,50],
>>>>>> min_samples_split:[22,28,34,40], bootstrap:[True, False]
>>>>>>
>>>>>> The best were max_features=4, max_depth = 25, min_samples_split:22,
>>>>>> bootstrap=True
>>>>>> Best score = 0.8725
>>>>>>
>>>>>> Then I used GridSearch among the best parameters found in the above
>>>>>> runs and found the best on as max_features=4, max_depth = 15,
>>>>>> min_samples_split:10,
>>>>>> Best score = 0.8729
>>>>>>
>>>>>> Then I used these parameters to predict for an unknown dataset but
>>>>>> got a very low score (around 0.72).
>>>>>>
>>>>>> My questions are; Am I doing the hyperparameter tuning correctly or I
>>>>>> am missing something?
>>>>>>
>>>>>> 2) Why is my testing score very low as compared to my training and
>>>>>> validation score and how can I improve it so that I get good predictions
>>>>>> out of my model?
>>>>>>
>>>>>> Sorry, if these are basic questions as I am new to scikit-learn and
>>>>>> ML.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>>>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>>>>> Monitor end-to-end web transactions and take corrective actions now
>>>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>>>>> _______________________________________________
>>>>>> Scikit-learn-general mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>
>>>>> --
>>>>>
>>>>> Sent by mobile phone
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>>>> Monitor end-to-end web transactions and take corrective actions now
>>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>>> Monitor end-to-end web transactions and take corrective actions now
>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>> --
>>>
>>> Sent by mobile phone
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>> Monitor end-to-end web transactions and take corrective actions now
>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>> Monitor end-to-end web transactions and take corrective actions now
>> Troubleshoot faster and improve end-user experience. Signup Now!
>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
> --
>
> Sent by mobile phone
>
>
> ------------------------------------------------------------------------------
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random forest low score on testing data

Reply via email to