I don't think we can deny this is strange, certainly for real-world, IID
data!

On 13 April 2016 at 10:31, Juan Nunez-Iglesias <jni.s...@gmail.com> wrote:

> Yes but would you expect sampling 280K / 3M to be qualitatively different
> from sampling 70K / 3M?
>
> At any rate I'll attempt a more rigorous test later this week and report
> back. Thanks!
>
> Juan.
>
> On Wed, Apr 13, 2016 at 10:21 AM, Joel Nothman <joel.noth...@gmail.com>
> wrote:
>
>> It's hard to believe this is a software problem rather than a data
>> problem. If your data was accidentally a duplicate of the dataset, you
>> could certainly get 100%.
>>
>> On 13 April 2016 at 10:10, Juan Nunez-Iglesias <jni.s...@gmail.com>
>> wrote:
>>
>>> Hallelujah! I'd given up on this thread. Thanks for resurrecting it,
>>> Andy! =)
>>>
>>> However, I don't think data distribution can explain the result, since
>>> GridSearchCV gives the expected result (~0.8 accuracy) with 3K and 70K
>>> random samples but changes to perfect classification for 280K samples. I
>>> don't have the data on this computer so I can't test it right now, though.
>>>
>>> Juan.
>>>
>>> On Wed, Apr 13, 2016 at 8:51 AM, Andreas Mueller <t3k...@gmail.com>
>>> wrote:
>>>
>>>> Have you tried to "score" the grid-search on the non-training set?
>>>> The cross-validation is using stratified k-fold while your confirmation
>>>> used the beginning of the dataset vs the rest.
>>>> Your data is probably not IID.
>>>>
>>>>
>>>>
>>>> On 03/10/2016 01:08 AM, Juan Nunez-Iglesias wrote:
>>>>
>>>> Hi all,
>>>>
>>>> TL;DR: when I run GridSearchCV with RandomForestClassifier and "many"
>>>> samples (280K), it falsely shows accuracy of 1.0 for full trees
>>>> (max_depth=None). This doesn't happen for fewer samples.
>>>>
>>>> Longer version:
>>>>
>>>> I'm trying to optimise RF hyperparameters using GridSearchCV for the
>>>> first time. I have a lot of data (~3M samples, 140 features), so I
>>>> subsampled it to do this. First I subsampled to 3000 samples, which
>>>> finished in 5min, so I ran 70K samples to see if result would still hold.
>>>> This resulted in completely different parameter choices, so I ran 280K
>>>> samples overnight, to see whether at least the choices would stabilise as n
>>>> -> inf. Then when I printed the top 10 models, I got the following:
>>>>
>>>> In [7]: bests = sorted(random_search.grid_scores_, reverse=True,
>>>> key=lambda x: x
>>>> [1])
>>>>
>>>> In [8]: bests[:10]
>>>> Out[8]:
>>>> [mean: 1.00000, std: 0.00000, params: {'n_estimators': 500,
>>>> 'bootstrap': True, '
>>>> max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
>>>>  mean: 1.00000, std: 0.00000, params: {'n_estimators': 500,
>>>> 'bootstrap': True, '
>>>> max_features': 5, 'max_depth': None, 'criterion': 'gini'},
>>>>  mean: 1.00000, std: 0.00000, params: {'n_estimators': 200,
>>>> 'bootstrap': True, '
>>>> max_features': 'auto', 'max_depth': None, 'criterion': 'entropy'},
>>>>  mean: 1.00000, std: 0.00000, params: {'n_estimators': 200,
>>>> 'bootstrap': True, '
>>>> max_features': 5, 'max_depth': None, 'criterion': 'entropy'},
>>>>  mean: 1.00000, std: 0.00000, params: {'n_estimators': 200,
>>>> 'bootstrap': True, '
>>>> max_features': 20, 'max_depth': None, 'criterion': 'entropy'},
>>>>  mean: 1.00000, std: 0.00000, params: {'n_estimators': 20, 'bootstrap':
>>>> False, '
>>>> max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
>>>>  mean: 1.00000, std: 0.00000, params: {'n_estimators': 100,
>>>> 'bootstrap': False,
>>>> 'max_features': 'auto', 'max_depth': None, 'criterion': 'gini'},
>>>>  mean: 1.00000, std: 0.00000, params: {'n_estimators': 20, 'bootstrap':
>>>> False, '
>>>> max_features': 5, 'max_depth': None, 'criterion': 'gini'},
>>>>  mean: 1.00000, std: 0.00000, params: {'n_estimators': 100,
>>>> 'bootstrap': False,
>>>> 'max_features': 5, 'max_depth': None, 'criterion': 'gini'},
>>>>  mean: 1.00000, std: 0.00000, params: {'n_estimators': 500,
>>>> 'bootstrap': False,
>>>> 'max_features': 5, 'max_depth': None, 'criterion': 'gini'}]
>>>>
>>>> Needless to say, perfect accuracy is suspicious, and indeed, in this
>>>> case, completely spurious:
>>>>
>>>> In [16]: rftop = ensemble.RandomForestClassifier(**{'n_estimators': 20,
>>>> 'bootstr
>>>> ap': False, 'max_features': 'auto', 'max_depth': None, 'criterion':
>>>> 'gini'})
>>>>
>>>> In [17]: rftop.fit(X[:200000], y[:200000])
>>>>
>>>> In [20]: np.mean(rftop.predict(X[200000:]) == y[200000:])
>>>> Out[20]: 0.826125
>>>>
>>>> That's more in line with what's expected for this dataset, and what was
>>>> found by the search with 72K samples (top model: [mean: 0.82640, std:
>>>> 0.00341, params: {'n_estimators': 500, 'bootstrap': False, 'max_features':
>>>> 20, 'max_depth': 20, 'criterion': 'gini'},)
>>>>
>>>> Anyway, here's my code, any idea why more samples would cause this
>>>> overfitting / testing on training data?
>>>>
>>>> # [omitted: boilerplate to load full data in X0, y0]
>>>> import numpy as np
>>>> idx = np.random.choice(len(y0), size=280000, replace=False)
>>>> X, y = X0[idx], y0[idx]
>>>> param_dist = {'n_estimators': [20, 100, 200, 500],
>>>>               'max_depth': [3, 5, 20, None],
>>>>               'max_features': ['auto', 5, 10, 20],
>>>>               'bootstrap': [True, False],
>>>>               'criterion': ['gini', 'entropy']}
>>>> from sklearn import grid_search as gs
>>>> from time import time
>>>> from sklearn import ensemble
>>>> rf = ensemble.RandomForestClassifier()
>>>> random_search = gs.GridSearchCV(rf, param_grid=param_dist, refit=False,
>>>>                                 verbose=2, n_jobs=12)
>>>> start=time(); random_search.fit(X, y); stop=time()
>>>>
>>>> Thank you!
>>>>
>>>> Juan.
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Transform Data into Opportunity.
>>>> Accelerate data analysis in your applications with
>>>> Intel Data Analytics Acceleration Library.
>>>> Click to learn 
>>>> more.http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Scikit-learn-general mailing 
>>>> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Find and fix application performance issues faster with Applications
>>>> Manager
>>>> Applications Manager provides deep performance insights into multiple
>>>> tiers of
>>>> your business applications. It resolves application problems quickly and
>>>> reduces your MTTR. Get your free trial!
>>>> https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Find and fix application performance issues faster with Applications
>>> Manager
>>> Applications Manager provides deep performance insights into multiple
>>> tiers of
>>> your business applications. It resolves application problems quickly and
>>> reduces your MTTR. Get your free trial!
>>> https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> Find and fix application performance issues faster with Applications
>> Manager
>> Applications Manager provides deep performance insights into multiple
>> tiers of
>> your business applications. It resolves application problems quickly and
>> reduces your MTTR. Get your free trial!
>> https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> Find and fix application performance issues faster with Applications
> Manager
> Applications Manager provides deep performance insights into multiple
> tiers of
> your business applications. It resolves application problems quickly and
> reduces your MTTR. Get your free trial!
> https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to