On 01/02/2012 06:58 PM, Gilles Louppe wrote: >>> The narrative docs say that max_features=n_features is a good value for >>> RandomForests. >>> As far as I know, Breiman 2001 suggests max_features = >>> log_2(n_features). I also >>> saw a claim that Breiman 2001 suggests max_features = sqrt(n_features) but I >>> couldn't find that in the paper. >>> I just tried "digits" and max_features = log_2(n_features) works better than >>> max_featurs = n_features. Of course that is definitely no conclusive >>> evidence ;) >>> Is there any reference that says max_features = n_features is good? >>> >>> Also, I think this default value contradicts the beginning of the >>> narrative docs a bit, >>> since that claims "In addition, when splitting a node during the >>> construction of the tree, >>> the split that is chosen is no longer the best split among all features. >>> Instead, the split that is picked is the best split among a random >>> subset of the features." >>> Later, a recommendation on using max_features = n_features is made, but >>> no connection to the explanation above is given. >> Short answer: the optimal value of max_features is problem-specific. >> >> In [1], it was found experimentally that max_features=sqrt(n_features) >> was working well for classification problems, and >> max_features=n_features for regression problems. This is a least the >> case for extra-trees. For random forests, I am no longer sure, I will >> check with my advisor. > Back to you. > > In the random forest manual [2], it is recommended to use > max_features=sqrt(n_features), with some warnings though: > > "mtry0 = the number of variables to split on at each node. Default is > the square root of mdim. ATTENTION! DO NOT USE THE DEFAULT VALUES OF > MTRY0 IF YOU WANT TO OPTIMIZE THE PERFORMANCE OF RANDOM FORESTS. TRY > DIFFERENT VALUES-GROW 20-30 TREES, AND SELECT THE VALUE OF MTRY THAT > GIVES THE SMALLEST OOB ERROR RATE." > > [2]: http://oz.berkeley.edu/users/breiman/RandomForests/cc_manual.htm > > I don't know why I had in mind that RFs should have > max_features=n_features by default. My bad. > > My advisor says that indeed log2 was at first recommended in Breiman's > paper, but sqrt was later prefered by Breiman, as [2] indeed > indicates. > > What I suggest is to add a string value max_features="auto" such that > max_features=sqrt(n_features) on classification problems and > max_features=n_features on regression. In the same way, we could add > max_features="sqrt" or max_features="log2" and let the user decides. Thanks for checking. I didn't know about the random forest manual. Will check that one. I am +1 about having an "auto" keyword with square root default and user specified otherwise.
> @amueller If you like, I can take care of all these changes (in that > case, I'll do it tomorrow). I thought I could do it today but didn't get to it. To tired to do it now. I haven't started so feel free ;) Cheers Andy ps: Happy new year also from me :) ------------------------------------------------------------------------------ Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex infrastructure or vast IT resources to deliver seamless, secure access to virtual desktops. With this all-in-one solution, easily deploy virtual desktops for less than the cost of PCs and save 60% on VDI infrastructure costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
