Re: [Scikit-learn-general] Random forest low score on testing data

Luca Puggini Fri, 05 Feb 2016 17:47:38 -0800

If I understood correctly he is using a train set that is used for model
identification and training. A test set is then used to evaluate the
results. If he gets good performances on the train set and bad on the test
set it may be due to the fact that the test set contains different
information respect to the train set. This is for example common in time
series.
On Fri 5 Feb 2016 at 21:43 Jacob Schreiber <[email protected]> wrote:


> I'm a bit unclear what you expect shuffling the data to do, Luca, since
> you end up taking a random sample if you bootstrap and re-ordering it
> anyway.
>
> Jacob
>
> On Fri, Feb 5, 2016 at 1:32 PM, muhammad waseem <[email protected]>
> wrote:
>
>> Hi Luca,
>> Thanks for your time and answer. I will try this with lower max_depth
>> (both for randomised and RF to see what happens)*.*
>> By number of variable used at each split, you mean min_samples_split,
>> right?
>>
>> I did not use OOB score.
>> I will also try to shuffle my data as well.
>>
>> Thanks again.
>>
>>
>> On Fri, Feb 5, 2016 at 8:46 PM, Luca Puggini <[email protected]>
>> wrote:
>>
>>> The number of trees (n estimators) should be as much large as possible.
>>> It does not cause over fitting.  In random forest over fitting is usually
>>> caused by the depth  and by variables with several unique values.  I'll
>>> suggest you to start using randomized trees with low depth.  If you want to
>>> use rf you can try to reduce the number of variables used at each split.
>>>
>>> Observe that if you use OOB to estimate the prediction error it may be
>>> biased when  the number of trees is large.
>>>
>>> In addition I'll suggest you to shuffle the data at the beginning if you
>>> can.
>>>
>>> On Fri, Feb 5, 2016, 5:14 PM muhammad waseem <[email protected]>
>>> wrote:
>>>
>>>> Thanks Luca, I will give it a try. When you say extremely randomised,
>>>> does this mean using large number of n_estimators?
>>>>
>>>> Also, any idea how to solve overfitting problem for random forest?
>>>>
>>>> Regards
>>>> Waseem
>>>>
>>>>
>>>> On Fri, Feb 5, 2016 at 5:00 PM, Luca Puggini <[email protected]>
>>>> wrote:
>>>>
>>>>> Here there are the extra trees
>>>>> http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html#sklearn.ensemble.ExtraTreesRegressor
>>>>>
>>>>> it work similarly to random forest.  In my experience RF tends often
>>>>> to overfit.
>>>>> I suggest you to start using the default parameters and cross validate
>>>>> only on the max_depth parameter.  Start with small values of max_depth [2,
>>>>> 3, 5, 7, 10] and check how the performances of the model change.
>>>>>
>>>>> Good Luck.
>>>>> Luca
>>>>>
>>>>> On Fri, Feb 5, 2016 at 4:28 PM muhammad waseem <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Luca,
>>>>>> Could you please explain how can do this randomized trees in
>>>>>> scikit-learn? So you suggest I should be using Random forest?
>>>>>>
>>>>>>
>>>>>> On Fri, Feb 5, 2016 at 4:13 PM, Luca Puggini <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> To me the score is not so low. The model is slightly over fitting.
>>>>>>> Try to repeat the same process with extremely randomized trees instead 
>>>>>>> of
>>>>>>> random forest and try to keep a low depth.
>>>>>>> On Fri 5 Feb 2016 at 16:01 muhammad waseem <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Dear All,
>>>>>>>> I am trying to train my model using Scikit-learn's Random forest
>>>>>>>> (Regression) and have tried to use GridSearch with Cross-validation 
>>>>>>>> (CV=5)
>>>>>>>> to tune hyperparameters. I fixed n_estimators =2000 for all cases. 
>>>>>>>> Below
>>>>>>>> are the few searches that I performed.
>>>>>>>>
>>>>>>>> 1) max_features :[1,3,5], max_depth :[1,5,10,15],
>>>>>>>> min_samples_split:[2,6,8,10], bootstrap:[True, False]
>>>>>>>> The best were max_features=5, max_depth = 15, min_samples_split:10,
>>>>>>>> bootstrap=True
>>>>>>>> Best score = 0.8724
>>>>>>>>
>>>>>>>> Then I searched close to the parameters that were best;
>>>>>>>> 2) max_features :[3,5,6], max_depth :[10,20,30,40],
>>>>>>>> min_samples_split:[8,16,20,24], bootstrap:[True, False]
>>>>>>>> The best were max_features=5, max_depth = 30, min_samples_split:20,
>>>>>>>> bootstrap=True
>>>>>>>> Best score = 0.8722
>>>>>>>>
>>>>>>>> Again, I searched close to the parameters that were best;
>>>>>>>> 3) max_features :[2,4,6], max_depth :[25,35,40,50],
>>>>>>>> min_samples_split:[22,28,34,40], bootstrap:[True, False]
>>>>>>>>
>>>>>>>> The best were max_features=4, max_depth = 25, min_samples_split:22,
>>>>>>>> bootstrap=True
>>>>>>>> Best score = 0.8725
>>>>>>>>
>>>>>>>> Then I used GridSearch among the best parameters found in the above
>>>>>>>> runs and found the best on as max_features=4, max_depth = 15,
>>>>>>>> min_samples_split:10,
>>>>>>>> Best score = 0.8729
>>>>>>>>
>>>>>>>> Then I used these parameters to predict for an unknown dataset but
>>>>>>>> got a very low score (around 0.72).
>>>>>>>>
>>>>>>>> My questions are; Am I doing the hyperparameter tuning correctly or
>>>>>>>> I am missing something?
>>>>>>>>
>>>>>>>> 2) Why is my testing score very low as compared to my training and
>>>>>>>> validation score and how can I improve it so that I get good 
>>>>>>>> predictions
>>>>>>>> out of my model?
>>>>>>>>
>>>>>>>> Sorry, if these are basic questions as I am new to scikit-learn and
>>>>>>>> ML.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>> Site24x7 APM Insight: Get Deep Visibility into Application
>>>>>>>> Performance
>>>>>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>>>>>>> Monitor end-to-end web transactions and take corrective actions now
>>>>>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>>>>>>> _______________________________________________
>>>>>>>> Scikit-learn-general mailing list
>>>>>>>> [email protected]
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Sent by mobile phone
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> Site24x7 APM Insight: Get Deep Visibility into Application
>>>>>>> Performance
>>>>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>>>>>> Monitor end-to-end web transactions and take corrective actions now
>>>>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>>>>>> _______________________________________________
>>>>>>> Scikit-learn-general mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>>>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>>>>> Monitor end-to-end web transactions and take corrective actions now
>>>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>>>>> _______________________________________________
>>>>>> Scikit-learn-general mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>
>>>>> --
>>>>>
>>>>> Sent by mobile phone
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>>>> Monitor end-to-end web transactions and take corrective actions now
>>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>>> Monitor end-to-end web transactions and take corrective actions now
>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>> --
>>>
>>> Sent by mobile phone
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>> Monitor end-to-end web transactions and take corrective actions now
>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>> Monitor end-to-end web transactions and take corrective actions now
>> Troubleshoot faster and improve end-user experience. Signup Now!
>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
> ------------------------------------------------------------------------------
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
-- 

Sent by mobile phone

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random forest low score on testing data

Reply via email to