Hi everybody. Recently, I started working with the RandomForest modules and there is a couple of things that I noticed that I would like to change. So this in particularly goes out to @glouppe, who is the expert on the field :)
1) The narrative docs say that max_features=n_features is a good value for RandomForests. As far as I know, Breiman 2001 suggests max_features = log_2(n_features). I also saw a claim that Breiman 2001 suggests max_features = sqrt(n_features) but I couldn't find that in the paper. I just tried "digits" and max_features = log_2(n_features) works better than max_featurs = n_features. Of course that is definitely no conclusive evidence ;) Is there any reference that says max_features = n_features is good? Also, I think this default value contradicts the beginning of the narrative docs a bit, since that claims "In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features." Later, a recommendation on using max_features = n_features is made, but no connection to the explanation above is given. 2) I noticed max_depth defaults to 10 in RandomForests, while the narrative docs say that max_dept = None yields best results. Is the default value chosen because "None" might take to long? 3) In the RandomForest docs, it's not clear to me from the documentation which parameters are parameters of the ensemble and which are parameters of the base estimator. I think that should be made more explicit. 4) Understanding the parameters "min density" took me some time, in particular because I didn't see that it was a parameter of the base estimator, not the ensemble. I think the docstring should start with "This parameter trades runtime against memory requirement of the base decision tree." or similar. 5) I think an explanation of "bootstrap" should go in the docs. The docstring just states "Whether bootstrap samples are used when building trees." I don't think this is very helpful since "bootstrap" is quite hard to look up for an outsider. 6) As far as I can see, it is possible to set "bootstrap" to 'False' and still have max_features = n_features. This would build n_estimator estimators that are identical, right? I think this option should somehow be excluded. Minor remarks that I'll fix if no-one objects: - All Forest classifiers should have Trees in the "see also section" Answers / comments welcome :) Cheers, Andy ------------------------------------------------------------------------------ Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex infrastructure or vast IT resources to deliver seamless, secure access to virtual desktops. With this all-in-one solution, easily deploy virtual desktops for less than the cost of PCs and save 60% on VDI infrastructure costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
