Hi,

Using max_features="auto" (default setting) indeed yields the results
that Paolo reports.

When setting max_features=None (i.e., using all features as in our
earlier code), I got the following on my machine:

RandomForest 778.1471s   1.2830s     0.0248
Extra-Trees  1325.2397s  1.3544s     0.0199

which is consistent with what is mentioned in the doc.

@pprett: Since max_features=sqrt(n_features) now by default on
classification problems, the trees are usually more randomized, hence
with a higher bias. To compensate for that, more trees usually need to
be build whereas we only use 20 trees in the benchmark (which is low
in my opinion). The effect of max_features is very dataset specific
though. On some problems, decreasing max_features does not impair
performance as much as here on covertype. I am not sure whether
one-hot-encoding is causing this.

Best,

Gilles

On 27 March 2012 13:38, Peter Prettenhofer <[email protected]> wrote:
> Interesting - covtype involves a number of categorical attributes
> which are represented via a one-hot encoding - do you think that such
> a representation has a significant effect on feature sampling and thus
> the performance of random forests?
>
> 2012/3/27 Gilles Louppe <[email protected]>:
>> Hi,
>>
>> I am running the tests again, but indeed I think the difference in the
>> results comes from that fact that max_features=sqrt(n_features) now by
>> default whereas it was max_features=n_features before.
>>
>> Gilles
>>
>> On 27 March 2012 11:56, Paolo Losi <[email protected]> wrote:
>>> Thanks Peter,
>>>
>>> On Tue, Mar 27, 2012 at 11:34 AM, Peter Prettenhofer
>>> <[email protected]> wrote:
>>>>
>>>> Paolo,
>>>>
>>>> I noticed that too - maybe @glouppe can comment on this - I think the
>>>> reason was a change in the ``n_features`` heuristic but I might be
>>>> mistaken.
>>>
>>>
>>> Gilles, can you give a quick look to it? If it's not anything obvious just
>>> ping back and I'll try to git bisect the issue...
>>>
>>>>
>>>> Concerning the GaussianNB - there's a PR [1] adressing a critical bug
>>>> in the estimator - it should be merged ASAP.
>>>
>>>
>>> Thank's. I've commented on the PR (the performance regression seems
>>> not to be connected with the PR)
>>>
>>>>
>>>> Furthermore, test time is
>>>> quite low - this might be due to memory layout issues - SGDClassifier
>>>> converts ``coef_`` to fortran-style for increased test-time
>>>> performance.
>>>
>>>
>>> Clear.
>>>
>>> Thanks again
>>>
>>> Paolo
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> This SF email is sponsosred by:
>>> Try Windows Azure free for 90 days Click Here
>>> http://p.sf.net/sfu/sfd2d-msazure
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>
>
>
> --
> Peter Prettenhofer

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to