Re: [Scikit-learn-general] Random forest low score on testing data

Luca Puggini Fri, 05 Feb 2016 18:53:41 -0800

suppose to have a medical datasets where the first 500 people are from
population A and the patients from 500 to 1000 are from population B.
People in pop A can be very different from the ones in pop B.   If you
train only in  the first half  of the data  the model may miss important
information relative to pop B.   If you shuffle the data at the beginning
you will have in both train and test sets samples from pop A and pop B.
I do not know if this can help muhammad as it is difficult to judge without
the data. It's worth to try as it is one line of code.


I hope this clarified.



On Sat, Feb 6, 2016 at 2:26 AM Jacob Schreiber <[email protected]>
wrote:

> Luca, I'm not sure I understand what you're saying. All test sets have
> different information than their training sets--why does that mean
> shuffling would help? Algorithmically the tree resorts the data anyway
> without caring about the order they were in originally.
>
> On Fri, Feb 5, 2016 at 5:50 PM, Luca Puggini <[email protected]> wrote:
>
>> @muhammad by number of variables at each split I mean 'max_features'.
>>
>> On Sat, Feb 6, 2016 at 1:45 AM Luca Puggini <[email protected]> wrote:
>>
>>> If I understood correctly he is using a train set that is used for model
>>> identification and training. A test set is then used to evaluate the
>>> results. If he gets good performances on the train set and bad on the test
>>> set it may be due to the fact that the test set contains different
>>> information respect to the train set. This is for example common in time
>>> series.
>>> On Fri 5 Feb 2016 at 21:43 Jacob Schreiber <[email protected]>
>>> wrote:
>>>
>>>> I'm a bit unclear what you expect shuffling the data to do, Luca, since
>>>> you end up taking a random sample if you bootstrap and re-ordering it
>>>> anyway.
>>>>
>>>> Jacob
>>>>
>>>> On Fri, Feb 5, 2016 at 1:32 PM, muhammad waseem <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Luca,
>>>>> Thanks for your time and answer. I will try this with lower max_depth
>>>>> (both for randomised and RF to see what happens)*.*
>>>>> By number of variable used at each split, you mean min_samples_split,
>>>>> right?
>>>>>
>>>>> I did not use OOB score.
>>>>> I will also try to shuffle my data as well.
>>>>>
>>>>> Thanks again.
>>>>>
>>>>>
>>>>> On Fri, Feb 5, 2016 at 8:46 PM, Luca Puggini <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> The number of trees (n estimators) should be as much large as
>>>>>> possible.  It does not cause over fitting.  In random forest over fitting
>>>>>> is usually caused by the depth  and by variables with several unique
>>>>>> values.  I'll suggest you to start using randomized trees with low depth.
>>>>>> If you want to use rf you can try to reduce the number of variables used 
>>>>>> at
>>>>>> each split.
>>>>>>
>>>>>> Observe that if you use OOB to estimate the prediction error it may
>>>>>> be biased when  the number of trees is large.
>>>>>>
>>>>>> In addition I'll suggest you to shuffle the data at the beginning if
>>>>>> you can.
>>>>>>
>>>>>> On Fri, Feb 5, 2016, 5:14 PM muhammad waseem <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Thanks Luca, I will give it a try. When you say extremely
>>>>>>> randomised, does this mean using large number of n_estimators?
>>>>>>>
>>>>>>> Also, any idea how to solve overfitting problem for random forest?
>>>>>>>
>>>>>>> Regards
>>>>>>> Waseem
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Feb 5, 2016 at 5:00 PM, Luca Puggini <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Here there are the extra trees
>>>>>>>> http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html#sklearn.ensemble.ExtraTreesRegressor
>>>>>>>>
>>>>>>>> it work similarly to random forest.  In my experience RF tends
>>>>>>>> often to overfit.
>>>>>>>> I suggest you to start using the default parameters and cross
>>>>>>>> validate only on the max_depth parameter.  Start with small values of
>>>>>>>> max_depth [2, 3, 5, 7, 10] and check how the performances of the model
>>>>>>>> change.
>>>>>>>>
>>>>>>>> Good Luck.
>>>>>>>> Luca
>>>>>>>>
>>>>>>>> On Fri, Feb 5, 2016 at 4:28 PM muhammad waseem <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Luca,
>>>>>>>>> Could you please explain how can do this randomized trees in
>>>>>>>>> scikit-learn? So you suggest I should be using Random forest?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Feb 5, 2016 at 4:13 PM, Luca Puggini <[email protected]
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> To me the score is not so low. The model is slightly over
>>>>>>>>>> fitting. Try to repeat the same process with extremely randomized 
>>>>>>>>>> trees
>>>>>>>>>> instead of random forest and try to keep a low depth.
>>>>>>>>>> On Fri 5 Feb 2016 at 16:01 muhammad waseem <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Dear All,
>>>>>>>>>>> I am trying to train my model using Scikit-learn's Random forest
>>>>>>>>>>> (Regression) and have tried to use GridSearch with Cross-validation 
>>>>>>>>>>> (CV=5)
>>>>>>>>>>> to tune hyperparameters. I fixed n_estimators =2000 for all cases. 
>>>>>>>>>>> Below
>>>>>>>>>>> are the few searches that I performed.
>>>>>>>>>>>
>>>>>>>>>>> 1) max_features :[1,3,5], max_depth :[1,5,10,15],
>>>>>>>>>>> min_samples_split:[2,6,8,10], bootstrap:[True, False]
>>>>>>>>>>> The best were max_features=5, max_depth = 15,
>>>>>>>>>>> min_samples_split:10, bootstrap=True
>>>>>>>>>>> Best score = 0.8724
>>>>>>>>>>>
>>>>>>>>>>> Then I searched close to the parameters that were best;
>>>>>>>>>>> 2) max_features :[3,5,6], max_depth :[10,20,30,40],
>>>>>>>>>>> min_samples_split:[8,16,20,24], bootstrap:[True, False]
>>>>>>>>>>> The best were max_features=5, max_depth = 30,
>>>>>>>>>>> min_samples_split:20, bootstrap=True
>>>>>>>>>>> Best score = 0.8722
>>>>>>>>>>>
>>>>>>>>>>> Again, I searched close to the parameters that were best;
>>>>>>>>>>> 3) max_features :[2,4,6], max_depth :[25,35,40,50],
>>>>>>>>>>> min_samples_split:[22,28,34,40], bootstrap:[True, False]
>>>>>>>>>>>
>>>>>>>>>>> The best were max_features=4, max_depth = 25,
>>>>>>>>>>> min_samples_split:22, bootstrap=True
>>>>>>>>>>> Best score = 0.8725
>>>>>>>>>>>
>>>>>>>>>>> Then I used GridSearch among the best parameters found in the
>>>>>>>>>>> above runs and found the best on as max_features=4, max_depth = 15,
>>>>>>>>>>> min_samples_split:10,
>>>>>>>>>>> Best score = 0.8729
>>>>>>>>>>>
>>>>>>>>>>> Then I used these parameters to predict for an unknown dataset
>>>>>>>>>>> but got a very low score (around 0.72).
>>>>>>>>>>>
>>>>>>>>>>> My questions are; Am I doing the hyperparameter tuning correctly
>>>>>>>>>>> or I am missing something?
>>>>>>>>>>>
>>>>>>>>>>> 2) Why is my testing score very low as compared to my training
>>>>>>>>>>> and validation score and how can I improve it so that I get good
>>>>>>>>>>> predictions out of my model?
>>>>>>>>>>>
>>>>>>>>>>> Sorry, if these are basic questions as I am new to scikit-learn
>>>>>>>>>>> and ML.
>>>>>>>>>>>
>>>>>>>>>>> Thanks!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>>> Site24x7 APM Insight: Get Deep Visibility into Application
>>>>>>>>>>> Performance
>>>>>>>>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>>>>>>>>>> Monitor end-to-end web transactions and take corrective actions
>>>>>>>>>>> now
>>>>>>>>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>>>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Scikit-learn-general mailing list
>>>>>>>>>>> [email protected]
>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> Sent by mobile phone
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>> Site24x7 APM Insight: Get Deep Visibility into Application
>>>>>>>>>> Performance
>>>>>>>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>>>>>>>>> Monitor end-to-end web transactions and take corrective actions
>>>>>>>>>> now
>>>>>>>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Scikit-learn-general mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>> Site24x7 APM Insight: Get Deep Visibility into Application
>>>>>>>>> Performance
>>>>>>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>>>>>>>> Monitor end-to-end web transactions and take corrective actions now
>>>>>>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>>>>>>>> _______________________________________________
>>>>>>>>> Scikit-learn-general mailing list
>>>>>>>>> [email protected]
>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Sent by mobile phone
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>> Site24x7 APM Insight: Get Deep Visibility into Application
>>>>>>>> Performance
>>>>>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>>>>>>> Monitor end-to-end web transactions and take corrective actions now
>>>>>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>>>>>>> _______________________________________________
>>>>>>>> Scikit-learn-general mailing list
>>>>>>>> [email protected]
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> Site24x7 APM Insight: Get Deep Visibility into Application
>>>>>>> Performance
>>>>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>>>>>> Monitor end-to-end web transactions and take corrective actions now
>>>>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>>>>>> _______________________________________________
>>>>>>> Scikit-learn-general mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>
>>>>>> --
>>>>>>
>>>>>> Sent by mobile phone
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>>>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>>>>> Monitor end-to-end web transactions and take corrective actions now
>>>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>>>>> _______________________________________________
>>>>>> Scikit-learn-general mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>>>> Monitor end-to-end web transactions and take corrective actions now
>>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>>> Monitor end-to-end web transactions and take corrective actions now
>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>> --
>>>
>>> Sent by mobile phone
>>>
>> --
>>
>> Sent by mobile phone
>>
>>
>> ------------------------------------------------------------------------------
>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>> Monitor end-to-end web transactions and take corrective actions now
>> Troubleshoot faster and improve end-user experience. Signup Now!
>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
> ------------------------------------------------------------------------------
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
-- 

Sent by mobile phone

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random forest low score on testing data

Reply via email to