Re: [Scikit-learn-general] Weird overfitting in GridSearchCV?

Andreas Mueller Tue, 12 Apr 2016 15:52:49 -0700

Have you tried to "score" the grid-search on the non-training set?

The cross-validation is using stratified k-fold while your confirmationused the beginning of the dataset vs the rest.

Your data is probably not IID.



On 03/10/2016 01:08 AM, Juan Nunez-Iglesias wrote:

Hi all,
TL;DR: when I run GridSearchCV with RandomForestClassifier and "many"samples (280K), it falsely shows accuracy of 1.0 for full trees(max_depth=None). This doesn't happen for fewer samples.
Longer version:
I'm trying to optimise RF hyperparameters using GridSearchCV for thefirst time. I have a lot of data (~3M samples, 140 features), so Isubsampled it to do this. First I subsampled to 3000 samples, whichfinished in 5min, so I ran 70K samples to see if result would stillhold. This resulted in completely different parameter choices, so Iran 280K samples overnight, to see whether at least the choices wouldstabilise as n -> inf. Then when I printed the top 10 models, I gotthe following:
In [7]: bests = sorted(random_search.grid_scores_, reverse=True,key=lambda x: x
[1])

In [8]: bests[:10]
Out[8]:
[mean: 1.00000, std: 0.00000, params: {'n_estimators': 500,'bootstrap': True, '
max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 500,'bootstrap': True, '
max_features': 5, 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 200,'bootstrap': True, '
max_features': 'auto', 'max_depth': None, 'criterion': 'entropy'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 200,'bootstrap': True, '
max_features': 5, 'max_depth': None, 'criterion': 'entropy'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 200,'bootstrap': True, '
max_features': 20, 'max_depth': None, 'criterion': 'entropy'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 20,'bootstrap': False, '
max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 100,'bootstrap': False,
'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 20,'bootstrap': False, '
max_features': 5, 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 100,'bootstrap': False,
'max_features': 5, 'max_depth': None, 'criterion': 'gini'},
mean: 1.00000, std: 0.00000, params: {'n_estimators': 500,'bootstrap': False,
'max_features': 5, 'max_depth': None, 'criterion': 'gini'}]
Needless to say, perfect accuracy is suspicious, and indeed, in thiscase, completely spurious:
In [16]: rftop = ensemble.RandomForestClassifier(**{'n_estimators':20, 'bootstrap': False, 'max_features': 'auto', 'max_depth': None, 'criterion':'gini'})
In [17]: rftop.fit(X[:200000], y[:200000])

In [20]: np.mean(rftop.predict(X[200000:]) == y[200000:])
Out[20]: 0.826125
That's more in line with what's expected for this dataset, and whatwas found by the search with 72K samples (top model: [mean: 0.82640,std: 0.00341, params: {'n_estimators': 500, 'bootstrap': False,'max_features': 20, 'max_depth': 20, 'criterion': 'gini'},)
Anyway, here's my code, any idea why more samples would cause thisoverfitting / testing on training data?
# [omitted: boilerplate to load full data in X0, y0]
import numpy as np
idx = np.random.choice(len(y0), size=280000, replace=False)
X, y = X0[idx], y0[idx]
param_dist = {'n_estimators': [20, 100, 200, 500],
              'max_depth': [3, 5, 20, None],
              'max_features': ['auto', 5, 10, 20],
              'bootstrap': [True, False],
              'criterion': ['gini', 'entropy']}
from sklearn import grid_search as gs
from time import time
from sklearn import ensemble
rf = ensemble.RandomForestClassifier()
random_search = gs.GridSearchCV(rf, param_grid=param_dist, refit=False,
                                verbose=2, n_jobs=12)
start=time(); random_search.fit(X, y); stop=time()

Thank you!

Juan.


------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140


_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Weird overfitting in GridSearchCV?

Reply via email to