The 280k were the staring of the sequence, while the 70k were from a shuffled bit, right?

On 04/12/2016 08:35 PM, Joel Nothman wrote:
I don't think we can deny this is strange, certainly for real-world, IID data!

On 13 April 2016 at 10:31, Juan Nunez-Iglesias <jni.s...@gmail.com <mailto:jni.s...@gmail.com>> wrote:

    Yes but would you expect sampling 280K / 3M to be qualitatively
    different from sampling 70K / 3M?

    At any rate I'll attempt a more rigorous test later this week and
    report back. Thanks!

    Juan.

    On Wed, Apr 13, 2016 at 10:21 AM, Joel Nothman
    <joel.noth...@gmail.com <mailto:joel.noth...@gmail.com>> wrote:

        It's hard to believe this is a software problem rather than a
        data problem. If your data was accidentally a duplicate of the
        dataset, you could certainly get 100%.

        On 13 April 2016 at 10:10, Juan Nunez-Iglesias
        <jni.s...@gmail.com <mailto:jni.s...@gmail.com>> wrote:

            Hallelujah! I'd given up on this thread. Thanks for
            resurrecting it, Andy! =)

            However, I don't think data distribution can explain the
            result, since GridSearchCV gives the expected result (~0.8
            accuracy) with 3K and 70K random samples but changes to
            perfect classification for 280K samples. I don't have the
            data on this computer so I can't test it right now, though.

            Juan.

            On Wed, Apr 13, 2016 at 8:51 AM, Andreas Mueller
            <t3k...@gmail.com <mailto:t3k...@gmail.com>> wrote:

                Have you tried to "score" the grid-search on the
                non-training set?
                The cross-validation is using stratified k-fold while
                your confirmation used the beginning of the dataset vs
                the rest.
                Your data is probably not IID.



                On 03/10/2016 01:08 AM, Juan Nunez-Iglesias wrote:
                Hi all,

                TL;DR: when I run GridSearchCV with
                RandomForestClassifier and "many" samples (280K), it
                falsely shows accuracy of 1.0 for full trees
                (max_depth=None). This doesn't happen for fewer samples.

                Longer version:

                I'm trying to optimise RF hyperparameters using
                GridSearchCV for the first time. I have a lot of data
                (~3M samples, 140 features), so I subsampled it to do
                this. First I subsampled to 3000 samples, which
                finished in 5min, so I ran 70K samples to see if
                result would still hold. This resulted in completely
                different parameter choices, so I ran 280K samples
                overnight, to see whether at least the choices would
                stabilise as n -> inf. Then when I printed the top 10
                models, I got the following:

                In [7]: bests = sorted(random_search.grid_scores_,
                reverse=True, key=lambda x: x
                [1])

                In [8]: bests[:10]
                Out[8]:
                [mean: 1.00000, std: 0.00000, params:
                {'n_estimators': 500, 'bootstrap': True, '
                max_features': 'auto', 'max_depth': None,
                'criterion': 'gini'},
                 mean: 1.00000, std: 0.00000, params:
                {'n_estimators': 500, 'bootstrap': True, '
                max_features': 5, 'max_depth': None, 'criterion':
                'gini'},
                 mean: 1.00000, std: 0.00000, params:
                {'n_estimators': 200, 'bootstrap': True, '
                max_features': 'auto', 'max_depth': None,
                'criterion': 'entropy'},
                 mean: 1.00000, std: 0.00000, params:
                {'n_estimators': 200, 'bootstrap': True, '
                max_features': 5, 'max_depth': None, 'criterion':
                'entropy'},
                 mean: 1.00000, std: 0.00000, params:
                {'n_estimators': 200, 'bootstrap': True, '
                max_features': 20, 'max_depth': None, 'criterion':
                'entropy'},
                 mean: 1.00000, std: 0.00000, params:
                {'n_estimators': 20, 'bootstrap': False, '
                max_features': 'auto', 'max_depth': None,
                'criterion': 'gini'},
                 mean: 1.00000, std: 0.00000, params:
                {'n_estimators': 100, 'bootstrap': False,
                'max_features': 'auto', 'max_depth': None,
                'criterion': 'gini'},
                 mean: 1.00000, std: 0.00000, params:
                {'n_estimators': 20, 'bootstrap': False, '
                max_features': 5, 'max_depth': None, 'criterion':
                'gini'},
                 mean: 1.00000, std: 0.00000, params:
                {'n_estimators': 100, 'bootstrap': False,
                'max_features': 5, 'max_depth': None, 'criterion':
                'gini'},
                 mean: 1.00000, std: 0.00000, params:
                {'n_estimators': 500, 'bootstrap': False,
                'max_features': 5, 'max_depth': None, 'criterion':
                'gini'}]

                Needless to say, perfect accuracy is suspicious, and
                indeed, in this case, completely spurious:

                In [16]: rftop =
                ensemble.RandomForestClassifier(**{'n_estimators':
                20, 'bootstr
                ap': False, 'max_features': 'auto', 'max_depth':
                None, 'criterion': 'gini'})

                In [17]: rftop.fit(X[:200000], y[:200000])

                In [20]: np.mean(rftop.predict(X[200000:]) == y[200000:])
                Out[20]: 0.826125

                That's more in line with what's expected for this
                dataset, and what was found by the search with 72K
                samples (top model: [mean: 0.82640, std: 0.00341,
                params: {'n_estimators': 500, 'bootstrap': False,
                'max_features': 20, 'max_depth': 20, 'criterion':
                'gini'},)

                Anyway, here's my code, any idea why more samples
                would cause this overfitting / testing on training data?

                # [omitted: boilerplate to load full data in X0, y0]
                import numpy as np
                idx = np.random.choice(len(y0), size=280000,
                replace=False)
                X, y = X0[idx], y0[idx]
                param_dist = {'n_estimators': [20, 100, 200, 500],
                  'max_depth': [3, 5, 20, None],
                'max_features': ['auto', 5, 10, 20],
                  'bootstrap': [True, False],
                  'criterion': ['gini', 'entropy']}
                from sklearn import grid_search as gs
                from time import time
                from sklearn import ensemble
                rf = ensemble.RandomForestClassifier()
                random_search = gs.GridSearchCV(rf,
                param_grid=param_dist, refit=False,
                verbose=2, n_jobs=12)
                start=time(); random_search.fit(X, y); stop=time()

                Thank you!

                Juan.


                
------------------------------------------------------------------------------
                Transform Data into Opportunity.
                Accelerate data analysis in your applications with
                Intel Data Analytics Acceleration Library.
                Click to learn more.
                http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140


                _______________________________________________
                Scikit-learn-general mailing list
                Scikit-learn-general@lists.sourceforge.net
                <mailto:Scikit-learn-general@lists.sourceforge.net>
                
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


                
------------------------------------------------------------------------------
                Find and fix application performance issues faster
                with Applications Manager
                Applications Manager provides deep performance
                insights into multiple tiers of
                your business applications. It resolves application
                problems quickly and
                reduces your MTTR. Get your free trial!
                https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
                _______________________________________________
                Scikit-learn-general mailing list
                Scikit-learn-general@lists.sourceforge.net
                <mailto:Scikit-learn-general@lists.sourceforge.net>
                
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



            
------------------------------------------------------------------------------
            Find and fix application performance issues faster with
            Applications Manager
            Applications Manager provides deep performance insights
            into multiple tiers of
            your business applications. It resolves application
            problems quickly and
            reduces your MTTR. Get your free trial!
            https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
            _______________________________________________
            Scikit-learn-general mailing list
            Scikit-learn-general@lists.sourceforge.net
            <mailto:Scikit-learn-general@lists.sourceforge.net>
            https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



        
------------------------------------------------------------------------------
        Find and fix application performance issues faster with
        Applications Manager
        Applications Manager provides deep performance insights into
        multiple tiers of
        your business applications. It resolves application problems
        quickly and
        reduces your MTTR. Get your free trial!
        https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
        _______________________________________________
        Scikit-learn-general mailing list
        Scikit-learn-general@lists.sourceforge.net
        <mailto:Scikit-learn-general@lists.sourceforge.net>
        https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



    
------------------------------------------------------------------------------
    Find and fix application performance issues faster with
    Applications Manager
    Applications Manager provides deep performance insights into
    multiple tiers of
    your business applications. It resolves application problems
    quickly and
    reduces your MTTR. Get your free trial!
    https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
    _______________________________________________
    Scikit-learn-general mailing list
    Scikit-learn-general@lists.sourceforge.net
    <mailto:Scikit-learn-general@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z


_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to