Re: [Scikit-learn-general] Random forest low score on testing data

muhammad waseem Mon, 08 Feb 2016 07:24:55 -0800

Hi Luca,
Thanks for your help. I have tried to shuffle my data (which made sense in
my case as it was ordered as days, months, hours). I have also tried
lowring max_depth with less number of features but it did not work for me.
I have also tried ExtraTreesRegressor but without any luck.
By using feature_importances_, I found out that one of the features was not
very important so I removed it but that did not work either with random
forest or extra trees as well. Any ideas what I could try?


Thanks
Regards
Waseem

On Sat, Feb 6, 2016 at 2:51 AM, Luca Puggini <lucapug...@gmail.com> wrote:

> suppose to have a medical datasets where the first 500 people are from
> population A and the patients from 500 to 1000 are from population B.
> People in pop A can be very different from the ones in pop B.   If you
> train only in  the first half  of the data  the model may miss important
> information relative to pop B.   If you shuffle the data at the beginning
> you will have in both train and test sets samples from pop A and pop B.
> I do not know if this can help muhammad as it is difficult to judge
> without the data. It's worth to try as it is one line of code.
>
> I hope this clarified.
>
>
>
> On Sat, Feb 6, 2016 at 2:26 AM Jacob Schreiber <jmschreibe...@gmail.com>
> wrote:
>
>> Luca, I'm not sure I understand what you're saying. All test sets have
>> different information than their training sets--why does that mean
>> shuffling would help? Algorithmically the tree resorts the data anyway
>> without caring about the order they were in originally.
>>
>> On Fri, Feb 5, 2016 at 5:50 PM, Luca Puggini <lucapug...@gmail.com>
>> wrote:
>>
>>> @muhammad by number of variables at each split I mean 'max_features'.
>>>
>>> On Sat, Feb 6, 2016 at 1:45 AM Luca Puggini <lucapug...@gmail.com>
>>> wrote:
>>>
>>>> If I understood correctly he is using a train set that is used for
>>>> model identification and training. A test set is then used to evaluate the
>>>> results. If he gets good performances on the train set and bad on the test
>>>> set it may be due to the fact that the test set contains different
>>>> information respect to the train set. This is for example common in time
>>>> series.
>>>> On Fri 5 Feb 2016 at 21:43 Jacob Schreiber <jmschreibe...@gmail.com>
>>>> wrote:
>>>>
>>>>> I'm a bit unclear what you expect shuffling the data to do, Luca,
>>>>> since you end up taking a random sample if you bootstrap and re-ordering 
>>>>> it
>>>>> anyway.
>>>>>
>>>>> Jacob
>>>>>
>>>>> On Fri, Feb 5, 2016 at 1:32 PM, muhammad waseem <
>>>>> m.waseem.ah...@gmail.com> wrote:
>>>>>
>>>>>> Hi Luca,
>>>>>> Thanks for your time and answer. I will try this with lower max_depth
>>>>>> (both for randomised and RF to see what happens)*.*
>>>>>> By number of variable used at each split, you mean min_samples_split,
>>>>>> right?
>>>>>>
>>>>>> I did not use OOB score.
>>>>>> I will also try to shuffle my data as well.
>>>>>>
>>>>>> Thanks again.
>>>>>>
>>>>>>
>>>>>> On Fri, Feb 5, 2016 at 8:46 PM, Luca Puggini <lucapug...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> The number of trees (n estimators) should be as much large as
>>>>>>> possible.  It does not cause over fitting.  In random forest over 
>>>>>>> fitting
>>>>>>> is usually caused by the depth  and by variables with several unique
>>>>>>> values.  I'll suggest you to start using randomized trees with low 
>>>>>>> depth.
>>>>>>> If you want to use rf you can try to reduce the number of variables 
>>>>>>> used at
>>>>>>> each split.
>>>>>>>
>>>>>>> Observe that if you use OOB to estimate the prediction error it may
>>>>>>> be biased when  the number of trees is large.
>>>>>>>
>>>>>>> In addition I'll suggest you to shuffle the data at the beginning if
>>>>>>> you can.
>>>>>>>
>>>>>>> On Fri, Feb 5, 2016, 5:14 PM muhammad waseem <
>>>>>>> m.waseem.ah...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks Luca, I will give it a try. When you say extremely
>>>>>>>> randomised, does this mean using large number of n_estimators?
>>>>>>>>
>>>>>>>> Also, any idea how to solve overfitting problem for random forest?
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Waseem
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Feb 5, 2016 at 5:00 PM, Luca Puggini <lucapug...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Here there are the extra trees
>>>>>>>>> http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html#sklearn.ensemble.ExtraTreesRegressor
>>>>>>>>>
>>>>>>>>> it work similarly to random forest.  In my experience RF tends
>>>>>>>>> often to overfit.
>>>>>>>>> I suggest you to start using the default parameters and cross
>>>>>>>>> validate only on the max_depth parameter.  Start with small values of
>>>>>>>>> max_depth [2, 3, 5, 7, 10] and check how the performances of the model
>>>>>>>>> change.
>>>>>>>>>
>>>>>>>>> Good Luck.
>>>>>>>>> Luca
>>>>>>>>>
>>>>>>>>> On Fri, Feb 5, 2016 at 4:28 PM muhammad waseem <
>>>>>>>>> m.waseem.ah...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Luca,
>>>>>>>>>> Could you please explain how can do this randomized trees in
>>>>>>>>>> scikit-learn? So you suggest I should be using Random forest?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Feb 5, 2016 at 4:13 PM, Luca Puggini <
>>>>>>>>>> lucapug...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> To me the score is not so low. The model is slightly over
>>>>>>>>>>> fitting. Try to repeat the same process with extremely randomized 
>>>>>>>>>>> trees
>>>>>>>>>>> instead of random forest and try to keep a low depth.
>>>>>>>>>>> On Fri 5 Feb 2016 at 16:01 muhammad waseem <
>>>>>>>>>>> m.waseem.ah...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Dear All,
>>>>>>>>>>>> I am trying to train my model using Scikit-learn's Random
>>>>>>>>>>>> forest (Regression) and have tried to use GridSearch with 
>>>>>>>>>>>> Cross-validation
>>>>>>>>>>>> (CV=5) to tune hyperparameters. I fixed n_estimators =2000 for all 
>>>>>>>>>>>> cases.
>>>>>>>>>>>> Below are the few searches that I performed.
>>>>>>>>>>>>
>>>>>>>>>>>> 1) max_features :[1,3,5], max_depth :[1,5,10,15],
>>>>>>>>>>>> min_samples_split:[2,6,8,10], bootstrap:[True, False]
>>>>>>>>>>>> The best were max_features=5, max_depth = 15,
>>>>>>>>>>>> min_samples_split:10, bootstrap=True
>>>>>>>>>>>> Best score = 0.8724
>>>>>>>>>>>>
>>>>>>>>>>>> Then I searched close to the parameters that were best;
>>>>>>>>>>>> 2) max_features :[3,5,6], max_depth :[10,20,30,40],
>>>>>>>>>>>> min_samples_split:[8,16,20,24], bootstrap:[True, False]
>>>>>>>>>>>> The best were max_features=5, max_depth = 30,
>>>>>>>>>>>> min_samples_split:20, bootstrap=True
>>>>>>>>>>>> Best score = 0.8722
>>>>>>>>>>>>
>>>>>>>>>>>> Again, I searched close to the parameters that were best;
>>>>>>>>>>>> 3) max_features :[2,4,6], max_depth :[25,35,40,50],
>>>>>>>>>>>> min_samples_split:[22,28,34,40], bootstrap:[True, False]
>>>>>>>>>>>>
>>>>>>>>>>>> The best were max_features=4, max_depth = 25,
>>>>>>>>>>>> min_samples_split:22, bootstrap=True
>>>>>>>>>>>> Best score = 0.8725
>>>>>>>>>>>>
>>>>>>>>>>>> Then I used GridSearch among the best parameters found in the
>>>>>>>>>>>> above runs and found the best on as max_features=4, max_depth = 15,
>>>>>>>>>>>> min_samples_split:10,
>>>>>>>>>>>> Best score = 0.8729
>>>>>>>>>>>>
>>>>>>>>>>>> Then I used these parameters to predict for an unknown dataset
>>>>>>>>>>>> but got a very low score (around 0.72).
>>>>>>>>>>>>
>>>>>>>>>>>> My questions are; Am I doing the hyperparameter tuning
>>>>>>>>>>>> correctly or I am missing something?
>>>>>>>>>>>>
>>>>>>>>>>>> 2) Why is my testing score very low as compared to my training
>>>>>>>>>>>> and validation score and how can I improve it so that I get good
>>>>>>>>>>>> predictions out of my model?
>>>>>>>>>>>>
>>>>>>>>>>>> Sorry, if these are basic questions as I am new to scikit-learn
>>>>>>>>>>>> and ML.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>>>> Site24x7 APM Insight: Get Deep Visibility into Application
>>>>>>>>>>>> Performance
>>>>>>>>>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just
>>>>>>>>>>>> $35/Month
>>>>>>>>>>>> Monitor end-to-end web transactions and take corrective actions
>>>>>>>>>>>> now
>>>>>>>>>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>>>>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Scikit-learn-general mailing list
>>>>>>>>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>>>>>>>>>
>>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> Sent by mobile phone
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>>> Site24x7 APM Insight: Get Deep Visibility into Application
>>>>>>>>>>> Performance
>>>>>>>>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>>>>>>>>>> Monitor end-to-end web transactions and take corrective actions
>>>>>>>>>>> now
>>>>>>>>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>>>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Scikit-learn-general mailing list
>>>>>>>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>> Site24x7 APM Insight: Get Deep Visibility into Application
>>>>>>>>>> Performance
>>>>>>>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>>>>>>>>> Monitor end-to-end web transactions and take corrective actions
>>>>>>>>>> now
>>>>>>>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Scikit-learn-general mailing list
>>>>>>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Sent by mobile phone
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>> Site24x7 APM Insight: Get Deep Visibility into Application
>>>>>>>>> Performance
>>>>>>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>>>>>>>> Monitor end-to-end web transactions and take corrective actions now
>>>>>>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>>>>>>>> _______________________________________________
>>>>>>>>> Scikit-learn-general mailing list
>>>>>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>> Site24x7 APM Insight: Get Deep Visibility into Application
>>>>>>>> Performance
>>>>>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>>>>>>> Monitor end-to-end web transactions and take corrective actions now
>>>>>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>>>>>>> _______________________________________________
>>>>>>>> Scikit-learn-general mailing list
>>>>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Sent by mobile phone
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> Site24x7 APM Insight: Get Deep Visibility into Application
>>>>>>> Performance
>>>>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>>>>>> Monitor end-to-end web transactions and take corrective actions now
>>>>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>>>>>> _______________________________________________
>>>>>>> Scikit-learn-general mailing list
>>>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>>>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>>>>> Monitor end-to-end web transactions and take corrective actions now
>>>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>>>>> _______________________________________________
>>>>>> Scikit-learn-general mailing list
>>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>
>>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>>>> Monitor end-to-end web transactions and take corrective actions now
>>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>> --
>>>>
>>>> Sent by mobile phone
>>>>
>>> --
>>>
>>> Sent by mobile phone
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>> Monitor end-to-end web transactions and take corrective actions now
>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>> ------------------------------------------------------------------------------
>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>> Monitor end-to-end web transactions and take corrective actions now
>> Troubleshoot faster and improve end-user experience. Signup Now!
>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
> --
>
> Sent by mobile phone
>
>
> ------------------------------------------------------------------------------
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random forest low score on testing data

Reply via email to