Re: [Scikit-learn-general] Question and comments on RandomForests

Andreas Mueller Mon, 02 Jan 2012 15:23:05 -0800

On 01/02/2012 06:58 PM, Gilles Louppe wrote:
>>> The narrative docs say that max_features=n_features is a good value for
>>> RandomForests.
>>> As far as I know, Breiman 2001 suggests max_features =
>>> log_2(n_features). I also
>>> saw a claim that Breiman 2001 suggests max_features = sqrt(n_features) but I
>>> couldn't find that in the paper.
>>> I just tried "digits" and max_features = log_2(n_features) works better than
>>> max_featurs = n_features. Of course that is definitely no conclusive
>>> evidence ;)
>>> Is there any reference that says max_features = n_features is good?
>>>
>>> Also, I think this default value contradicts the beginning of the
>>> narrative docs a bit,
>>> since that claims "In addition, when splitting a node during the
>>> construction of the tree,
>>> the split that is chosen is no longer the best split among all features.
>>> Instead, the split that is picked is the best split among a random
>>> subset of the features."
>>> Later, a recommendation on using max_features = n_features is made, but
>>> no connection to the explanation above is given.
>> Short answer: the optimal value of max_features is problem-specific.
>>
>> In [1], it was found experimentally that max_features=sqrt(n_features)
>> was working well for classification problems, and
>> max_features=n_features for regression problems. This is a least the
>> case for extra-trees. For random forests, I am no longer sure, I will
>> check with my advisor.
> Back to you.
>
> In the random forest manual [2], it is recommended to use
> max_features=sqrt(n_features), with some warnings though:
>
> "mtry0 = the number of variables to split on at each node. Default is
> the square root of mdim. ATTENTION! DO NOT USE THE DEFAULT VALUES OF
> MTRY0 IF YOU WANT TO OPTIMIZE THE PERFORMANCE OF RANDOM FORESTS. TRY
> DIFFERENT VALUES-GROW 20-30 TREES, AND SELECT THE VALUE OF MTRY THAT
> GIVES THE SMALLEST OOB ERROR RATE."
>
> [2]: http://oz.berkeley.edu/users/breiman/RandomForests/cc_manual.htm
>
> I don't know why I had in mind that RFs should have
> max_features=n_features by default. My bad.
>
> My advisor says that indeed log2 was at first recommended in Breiman's
> paper, but sqrt was later prefered by Breiman, as [2] indeed
> indicates.
>
> What I suggest is to add a string value max_features="auto" such that
> max_features=sqrt(n_features) on classification problems and
> max_features=n_features on regression. In the same way, we could add
> max_features="sqrt" or max_features="log2" and let the user decides.
Thanks for checking. I didn't know about the random forest manual.
Will check that one.
I am +1 about having an "auto" keyword with square root default
and user specified otherwise.


> @amueller If you like, I can take care of all these changes (in that
> case, I'll do it tomorrow).
I thought I could do it today but didn't get to it.
To tired to do it now. I haven't started so feel free ;)

Cheers Andy

ps: Happy new year also from me :)

------------------------------------------------------------------------------
Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex
infrastructure or vast IT resources to deliver seamless, secure access to
virtual desktops. With this all-in-one solution, easily deploy virtual 
desktops for less than the cost of PCs and save 60% on VDI infrastructure 
costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Question and comments on RandomForests

Reply via email to