Hi, Using max_features="auto" (default setting) indeed yields the results that Paolo reports.
When setting max_features=None (i.e., using all features as in our earlier code), I got the following on my machine: RandomForest 778.1471s 1.2830s 0.0248 Extra-Trees 1325.2397s 1.3544s 0.0199 which is consistent with what is mentioned in the doc. @pprett: Since max_features=sqrt(n_features) now by default on classification problems, the trees are usually more randomized, hence with a higher bias. To compensate for that, more trees usually need to be build whereas we only use 20 trees in the benchmark (which is low in my opinion). The effect of max_features is very dataset specific though. On some problems, decreasing max_features does not impair performance as much as here on covertype. I am not sure whether one-hot-encoding is causing this. Best, Gilles On 27 March 2012 13:38, Peter Prettenhofer <[email protected]> wrote: > Interesting - covtype involves a number of categorical attributes > which are represented via a one-hot encoding - do you think that such > a representation has a significant effect on feature sampling and thus > the performance of random forests? > > 2012/3/27 Gilles Louppe <[email protected]>: >> Hi, >> >> I am running the tests again, but indeed I think the difference in the >> results comes from that fact that max_features=sqrt(n_features) now by >> default whereas it was max_features=n_features before. >> >> Gilles >> >> On 27 March 2012 11:56, Paolo Losi <[email protected]> wrote: >>> Thanks Peter, >>> >>> On Tue, Mar 27, 2012 at 11:34 AM, Peter Prettenhofer >>> <[email protected]> wrote: >>>> >>>> Paolo, >>>> >>>> I noticed that too - maybe @glouppe can comment on this - I think the >>>> reason was a change in the ``n_features`` heuristic but I might be >>>> mistaken. >>> >>> >>> Gilles, can you give a quick look to it? If it's not anything obvious just >>> ping back and I'll try to git bisect the issue... >>> >>>> >>>> Concerning the GaussianNB - there's a PR [1] adressing a critical bug >>>> in the estimator - it should be merged ASAP. >>> >>> >>> Thank's. I've commented on the PR (the performance regression seems >>> not to be connected with the PR) >>> >>>> >>>> Furthermore, test time is >>>> quite low - this might be due to memory layout issues - SGDClassifier >>>> converts ``coef_`` to fortran-style for increased test-time >>>> performance. >>> >>> >>> Clear. >>> >>> Thanks again >>> >>> Paolo >>> >>> >>> ------------------------------------------------------------------------------ >>> This SF email is sponsosred by: >>> Try Windows Azure free for 90 days Click Here >>> http://p.sf.net/sfu/sfd2d-msazure >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> [email protected] >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>> > > > > -- > Peter Prettenhofer ------------------------------------------------------------------------------ This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
