Hi Andy! > 1) > The narrative docs say that max_features=n_features is a good value for > RandomForests. > As far as I know, Breiman 2001 suggests max_features = > log_2(n_features). I also > saw a claim that Breiman 2001 suggests max_features = sqrt(n_features) but I > couldn't find that in the paper. > I just tried "digits" and max_features = log_2(n_features) works better than > max_featurs = n_features. Of course that is definitely no conclusive > evidence ;) > Is there any reference that says max_features = n_features is good? > > Also, I think this default value contradicts the beginning of the > narrative docs a bit, > since that claims "In addition, when splitting a node during the > construction of the tree, > the split that is chosen is no longer the best split among all features. > Instead, the split that is picked is the best split among a random > subset of the features." > Later, a recommendation on using max_features = n_features is made, but > no connection to the explanation above is given.
Short answer: the optimal value of max_features is problem-specific. In [1], it was found experimentally that max_features=sqrt(n_features) was working well for classification problems, and max_features=n_features for regression problems. This is a least the case for extra-trees. For random forests, I am no longer sure, I will check with my advisor. [1] http://orbi.ulg.ac.be/bitstream/2268/9357/1/geurts-mlj-advance.pdf Anyway, I agree that this paragraph may be confusing wrt to default values in the code and would need some changes. > 2) > I noticed max_depth defaults to 10 in RandomForests, while the narrative > docs say > that max_dept = None yields best results. Is the default value chosen > because > "None" might take to long? This was set to 10 because in our first implementations, it may indeed have taken too long. It was not changed since. I nearly always use max_depth=None in practice and instead use min_split to control the depth of the tree. I agree we should make max_depth=None by default. > 3) > In the RandomForest docs, it's not clear to me from the documentation which > parameters are parameters of the ensemble and which are parameters of the > base estimator. I think that should be made more explicit. Agreed, we should make that more explicit. > 4) Understanding the parameters "min density" took me some time, > in particular because I didn't see that it was a parameter of the > base estimator, not the ensemble. I think the docstring should start with > "This parameter trades runtime against memory requirement of the > base decision tree." or similar. Agreed. > 5) I think an explanation of "bootstrap" should go in the docs. > The docstring just states "Whether bootstrap samples are used when > building trees." > I don't think this is very helpful since "bootstrap" is quite hard to > look up for > an outsider. Okay, we should make that more explicit. I didn't realize it was obscure. > 6) As far as I can see, it is possible to set "bootstrap" to 'False' and > still > have max_features = n_features. > This would build n_estimator estimators that are identical, right? > I think this option should somehow be excluded. Using random forests, yes they would identical. They wouldn't for extra-trees. > > Minor remarks that I'll fix if no-one objects: > > - All Forest classifiers should have Trees in the "see also section" Agreed. > > > Answers / comments welcome :) > > Cheers, > Andy > > > ------------------------------------------------------------------------------ > Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex > infrastructure or vast IT resources to deliver seamless, secure access to > virtual desktops. With this all-in-one solution, easily deploy virtual > desktops for less than the cost of PCs and save 60% on VDI infrastructure > costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex infrastructure or vast IT resources to deliver seamless, secure access to virtual desktops. With this all-in-one solution, easily deploy virtual desktops for less than the cost of PCs and save 60% on VDI infrastructure costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
