On Mon, Jan 02, 2012 at 04:24:50PM +0100, Andreas wrote: > 1) > The narrative docs say that max_features=n_features is a good value for > RandomForests. > As far as I know, Breiman 2001 suggests max_features = > log_2(n_features). I also > saw a claim that Breiman 2001 suggests max_features = sqrt(n_features) but I > couldn't find that in the paper. > I just tried "digits" and max_features = log_2(n_features) works better than > max_featurs = n_features. Of course that is definitely no conclusive > evidence ;) > Is there any reference that says max_features = n_features is good?
Actually, I believe consistency can be shown for random forest greedily grown (as they are in the standard implementations) if they are many samples per leaf: http://jmlr.csail.mit.edu/papers/volume9/biau08a/biau08a.pdf, theorem 9: the number of leafs k goes as k = o(sqrt(n/log(n))) For me, this makes sens intuitively: overfit is prevented by some sort of averaging. This averaging works better is each leaf has more than one sample. Now for better rules of thumb, I have no references :). Thanks for the discussion, Andreas and Gilles, having different people hammering on the code and the docs definitely helps making it accessible to everybody. Gael PS: happy new year. ------------------------------------------------------------------------------ Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex infrastructure or vast IT resources to deliver seamless, secure access to virtual desktops. With this all-in-one solution, easily deploy virtual desktops for less than the cost of PCs and save 60% on VDI infrastructure costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
