Re: [Scikit-learn-general] covertype benchmark and unexpected extra trees and random forest results

Peter Prettenhofer Tue, 27 Mar 2012 05:57:34 -0700

2012/3/27 Paolo Losi <[email protected]>:
> Gilles,
>
> thank you very much for having checked.
>
> If everyone agrees I'll:
>
> - uncomment extratrees and randomforest benchmark (@pprett is there
>   any valid reason to leave them out?)


no, absolutely not - I just forgot to uncomment them - thx

> - explicitly config max_features=None for RandomForest and ExtraTrees

+1

>
> Thanks again
>
> Paolo
>
> On Tue, Mar 27, 2012 at 2:13 PM, Gilles Louppe <[email protected]> wrote:
>>
>> Hi,
>>
>> Using max_features="auto" (default setting) indeed yields the results
>> that Paolo reports.
>>
>> When setting max_features=None (i.e., using all features as in our
>> earlier code), I got the following on my machine:
>>
>> RandomForest 778.1471s   1.2830s     0.0248
>> Extra-Trees  1325.2397s  1.3544s     0.0199
>>
>> which is consistent with what is mentioned in the doc.
>>
>> @pprett: Since max_features=sqrt(n_features) now by default on
>> classification problems, the trees are usually more randomized, hence
>> with a higher bias. To compensate for that, more trees usually need to
>> be build whereas we only use 20 trees in the benchmark (which is low
>> in my opinion). The effect of max_features is very dataset specific
>> though. On some problems, decreasing max_features does not impair
>> performance as much as here on covertype. I am not sure whether
>> one-hot-encoding is causing this.
>>
>> Best,
>>
>> Gilles
>>
>> On 27 March 2012 13:38, Peter Prettenhofer <[email protected]>
>> wrote:
>> > Interesting - covtype involves a number of categorical attributes
>> > which are represented via a one-hot encoding - do you think that such
>> > a representation has a significant effect on feature sampling and thus
>> > the performance of random forests?
>> >
>> > 2012/3/27 Gilles Louppe <[email protected]>:
>> >> Hi,
>> >>
>> >> I am running the tests again, but indeed I think the difference in the
>> >> results comes from that fact that max_features=sqrt(n_features) now by
>> >> default whereas it was max_features=n_features before.
>> >>
>> >> Gilles
>> >>
>> >> On 27 March 2012 11:56, Paolo Losi <[email protected]> wrote:
>> >>> Thanks Peter,
>> >>>
>> >>> On Tue, Mar 27, 2012 at 11:34 AM, Peter Prettenhofer
>> >>> <[email protected]> wrote:
>> >>>>
>> >>>> Paolo,
>> >>>>
>> >>>> I noticed that too - maybe @glouppe can comment on this - I think the
>> >>>> reason was a change in the ``n_features`` heuristic but I might be
>> >>>> mistaken.
>> >>>
>> >>>
>> >>> Gilles, can you give a quick look to it? If it's not anything obvious
>> >>> just
>> >>> ping back and I'll try to git bisect the issue...
>> >>>
>> >>>>
>> >>>> Concerning the GaussianNB - there's a PR [1] adressing a critical bug
>> >>>> in the estimator - it should be merged ASAP.
>> >>>
>> >>>
>> >>> Thank's. I've commented on the PR (the performance regression seems
>> >>> not to be connected with the PR)
>> >>>
>> >>>>
>> >>>> Furthermore, test time is
>> >>>> quite low - this might be due to memory layout issues - SGDClassifier
>> >>>> converts ``coef_`` to fortran-style for increased test-time
>> >>>> performance.
>> >>>
>> >>>
>> >>> Clear.
>> >>>
>> >>> Thanks again
>> >>>
>> >>> Paolo
>> >>>
>> >>>
>> >>>
>> >>> ------------------------------------------------------------------------------
>> >>> This SF email is sponsosred by:
>> >>> Try Windows Azure free for 90 days Click Here
>> >>> http://p.sf.net/sfu/sfd2d-msazure
>> >>> _______________________________________________
>> >>> Scikit-learn-general mailing list
>> >>> [email protected]
>> >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> >>>
>> >
>> >
>> >
>> > --
>> > Peter Prettenhofer
>>
>>
>> ------------------------------------------------------------------------------
>> This SF email is sponsosred by:
>> Try Windows Azure free for 90 days Click Here
>> http://p.sf.net/sfu/sfd2d-msazure
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
> --
> Paolo Losi
> e-mail: [email protected]
> mob:   +39 348 7705261
>
> ENUAN Srl
> Via XX Settembre, 12 - 29100 Piacenza
>
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>



-- 
Peter Prettenhofer

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] covertype benchmark and unexpected extra trees and random forest results

Reply via email to