Re: [Scikit-learn-general] Weird overfitting in GridSearchCV?

Andreas Mueller Thu, 14 Apr 2016 15:32:49 -0700

When you're shuffling you get 100% accuracy, and when you're notshuffling you don't, right?


On 04/14/2016 06:16 PM, Juan Nunez-Iglesias wrote:

No, both the 280K and the 70K were random indices. See the code at theend of the original post. The 200K at the start were merely me doing aquick check that the classifier *wasn't* perfectly accurate as claimedby the grid search.

On Thu, Apr 14, 2016 at 3:38 AM, Andreas Mueller <[email protected]<mailto:[email protected]>> wrote:


    The 280k were the staring of the sequence, while the 70k were from
    a shuffled bit, right?


    On 04/12/2016 08:35 PM, Joel Nothman wrote:

    I don't think we can deny this is strange, certainly for
    real-world, IID data!

    On 13 April 2016 at 10:31, Juan Nunez-Iglesias
    <[email protected] <mailto:[email protected]>> wrote:

        Yes but would you expect sampling 280K / 3M to be
        qualitatively different from sampling 70K / 3M?

        At any rate I'll attempt a more rigorous test later this week
        and report back. Thanks!

        Juan.

        On Wed, Apr 13, 2016 at 10:21 AM, Joel Nothman
        <[email protected] <mailto:[email protected]>> wrote:

            It's hard to believe this is a software problem rather
            than a data problem. If your data was accidentally a
            duplicate of the dataset, you could certainly get 100%.

            On 13 April 2016 at 10:10, Juan Nunez-Iglesias
            <[email protected] <mailto:[email protected]>> wrote:

                Hallelujah! I'd given up on this thread. Thanks for
                resurrecting it, Andy! =)

                However, I don't think data distribution can explain
                the result, since GridSearchCV gives the expected
                result (~0.8 accuracy) with 3K and 70K random samples
                but changes to perfect classification for 280K
                samples. I don't have the data on this computer so I
                can't test it right now, though.

                Juan.

                On Wed, Apr 13, 2016 at 8:51 AM, Andreas Mueller
                <[email protected] <mailto:[email protected]>> wrote:

                    Have you tried to "score" the grid-search on the
                    non-training set?
                    The cross-validation is using stratified k-fold
                    while your confirmation used the beginning of the
                    dataset vs the rest.
                    Your data is probably not IID.



                    On 03/10/2016 01:08 AM, Juan Nunez-Iglesias wrote:

                    Hi all,

                    TL;DR: when I run GridSearchCV with
                    RandomForestClassifier and "many" samples
                    (280K), it falsely shows accuracy of 1.0 for
                    full trees (max_depth=None). This doesn't happen
                    for fewer samples.

                    Longer version:

                    I'm trying to optimise RF hyperparameters using
                    GridSearchCV for the first time. I have a lot of
                    data (~3M samples, 140 features), so I
                    subsampled it to do this. First I subsampled to
                    3000 samples, which finished in 5min, so I ran
                    70K samples to see if result would still hold.
                    This resulted in completely different parameter
                    choices, so I ran 280K samples overnight, to see
                    whether at least the choices would stabilise as
                    n -> inf. Then when I printed the top 10 models,
                    I got the following:

                    In [7]: bests =
                    sorted(random_search.grid_scores_, reverse=True,
                    key=lambda x: x
                    [1])

                    In [8]: bests[:10]
                    Out[8]:
                    [mean: 1.00000, std: 0.00000, params:
                    {'n_estimators': 500, 'bootstrap': True, '
                    max_features': 'auto', 'max_depth': None,
                    'criterion': 'gini'},
                     mean: 1.00000, std: 0.00000, params:
                    {'n_estimators': 500, 'bootstrap': True, '
                    max_features': 5, 'max_depth': None,
                    'criterion': 'gini'},
                     mean: 1.00000, std: 0.00000, params:
                    {'n_estimators': 200, 'bootstrap': True, '
                    max_features': 'auto', 'max_depth': None,
                    'criterion': 'entropy'},
                     mean: 1.00000, std: 0.00000, params:
                    {'n_estimators': 200, 'bootstrap': True, '
                    max_features': 5, 'max_depth': None,
                    'criterion': 'entropy'},
                     mean: 1.00000, std: 0.00000, params:
                    {'n_estimators': 200, 'bootstrap': True, '
                    max_features': 20, 'max_depth': None,
                    'criterion': 'entropy'},
                     mean: 1.00000, std: 0.00000, params:
                    {'n_estimators': 20, 'bootstrap': False, '
                    max_features': 'auto', 'max_depth': None,
                    'criterion': 'gini'},
                     mean: 1.00000, std: 0.00000, params:
                    {'n_estimators': 100, 'bootstrap': False,
                    'max_features': 'auto', 'max_depth': None,
                    'criterion': 'gini'},
                     mean: 1.00000, std: 0.00000, params:
                    {'n_estimators': 20, 'bootstrap': False, '
                    max_features': 5, 'max_depth': None,
                    'criterion': 'gini'},
                     mean: 1.00000, std: 0.00000, params:
                    {'n_estimators': 100, 'bootstrap': False,
                    'max_features': 5, 'max_depth': None,
                    'criterion': 'gini'},
                     mean: 1.00000, std: 0.00000, params:
                    {'n_estimators': 500, 'bootstrap': False,
                    'max_features': 5, 'max_depth': None,
                    'criterion': 'gini'}]

                    Needless to say, perfect accuracy is suspicious,
                    and indeed, in this case, completely spurious:

                    In [16]: rftop =
                    ensemble.RandomForestClassifier(**{'n_estimators':
                    20, 'bootstr
                    ap': False, 'max_features': 'auto', 'max_depth':
                    None, 'criterion': 'gini'})

                    In [17]: rftop.fit(X[:200000], y[:200000])

                    In [20]: np.mean(rftop.predict(X[200000:]) ==
                    y[200000:])
                    Out[20]: 0.826125

                    That's more in line with what's expected for
                    this dataset, and what was found by the search
                    with 72K samples (top model: [mean: 0.82640,
                    std: 0.00341, params: {'n_estimators': 500,
                    'bootstrap': False, 'max_features': 20,
                    'max_depth': 20, 'criterion': 'gini'},)

                    Anyway, here's my code, any idea why more
                    samples would cause this overfitting / testing
                    on training data?

                    # [omitted: boilerplate to load full data in X0, y0]
                    import numpy as np
                    idx = np.random.choice(len(y0), size=280000,
                    replace=False)
                    X, y = X0[idx], y0[idx]
                    param_dist = {'n_estimators': [20, 100, 200, 500],
                    'max_depth': [3, 5, 20, None],
                    'max_features': ['auto', 5, 10, 20],
                    'bootstrap': [True, False],
                    'criterion': ['gini', 'entropy']}
                    from sklearn import grid_search as gs
                    from time import time
                    from sklearn import ensemble
                    rf = ensemble.RandomForestClassifier()
                    random_search = gs.GridSearchCV(rf,
                    param_grid=param_dist, refit=False,
                    verbose=2, n_jobs=12)
                    start=time(); random_search.fit(X, y); stop=time()

                    Thank you!

                    Juan.


                    
------------------------------------------------------------------------------
                    Transform Data into Opportunity.
                    Accelerate data analysis in your applications with
                    Intel Data Analytics Acceleration Library.
                    Click to learn more.
                    
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140


                    _______________________________________________
                    Scikit-learn-general mailing list
                    [email protected]
                    <mailto:[email protected]>
                    
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



                    
------------------------------------------------------------------------------
                    Find and fix application performance issues
                    faster with Applications Manager
                    Applications Manager provides deep performance
                    insights into multiple tiers of
                    your business applications. It resolves
                    application problems quickly and
                    reduces your MTTR. Get your free trial!
                    https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
                    _______________________________________________
                    Scikit-learn-general mailing list
                    [email protected]
                    <mailto:[email protected]>
                    
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



                
------------------------------------------------------------------------------
                Find and fix application performance issues faster
                with Applications Manager
                Applications Manager provides deep performance
                insights into multiple tiers of
                your business applications. It resolves application
                problems quickly and
                reduces your MTTR. Get your free trial!
                https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
                _______________________________________________
                Scikit-learn-general mailing list
                [email protected]
                <mailto:[email protected]>
                
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



            
------------------------------------------------------------------------------
            Find and fix application performance issues faster with
            Applications Manager
            Applications Manager provides deep performance insights
            into multiple tiers of
            your business applications. It resolves application
            problems quickly and
            reduces your MTTR. Get your free trial!
            https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
            _______________________________________________
            Scikit-learn-general mailing list
            [email protected]
            <mailto:[email protected]>
            https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



        
------------------------------------------------------------------------------
        Find and fix application performance issues faster with
        Applications Manager
        Applications Manager provides deep performance insights into
        multiple tiers of
        your business applications. It resolves application problems
        quickly and
        reduces your MTTR. Get your free trial!
        https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
        _______________________________________________
        Scikit-learn-general mailing list
        [email protected]
        <mailto:[email protected]>
        https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




    
------------------------------------------------------------------------------
    Find and fix application performance issues faster with Applications Manager
    Applications Manager provides deep performance insights into multiple tiers 
of
    your business applications. It resolves application problems quickly and
    reduces your MTTR. Get your free trial!
    https://ad.doubleclick.net/ddm/clk/302982198;130105516;z


    _______________________________________________
    Scikit-learn-general mailing list
    [email protected]
    <mailto:[email protected]>
    https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



    
------------------------------------------------------------------------------
    Find and fix application performance issues faster with
    Applications Manager
    Applications Manager provides deep performance insights into
    multiple tiers of
    your business applications. It resolves application problems
    quickly and
    reduces your MTTR. Get your free trial!
    https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
    _______________________________________________
    Scikit-learn-general mailing list
    [email protected]
    <mailto:[email protected]>
    https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z


_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Weird overfitting in GridSearchCV?

Reply via email to