Re: [Scikit-learn-general] Question and comments on RandomForests

Gilles Louppe Mon, 02 Jan 2012 07:49:45 -0800

Hi Andy!

> 1)
> The narrative docs say that max_features=n_features is a good value for
> RandomForests.
> As far as I know, Breiman 2001 suggests max_features =
> log_2(n_features). I also
> saw a claim that Breiman 2001 suggests max_features = sqrt(n_features) but I
> couldn't find that in the paper.
> I just tried "digits" and max_features = log_2(n_features) works better than
> max_featurs = n_features. Of course that is definitely no conclusive
> evidence ;)
> Is there any reference that says max_features = n_features is good?
>
> Also, I think this default value contradicts the beginning of the
> narrative docs a bit,
> since that claims "In addition, when splitting a node during the
> construction of the tree,
> the split that is chosen is no longer the best split among all features.
> Instead, the split that is picked is the best split among a random
> subset of the features."
> Later, a recommendation on using max_features = n_features is made, but
> no connection to the explanation above is given.


Short answer: the optimal value of max_features is problem-specific.

In [1], it was found experimentally that max_features=sqrt(n_features)
was working well for classification problems, and
max_features=n_features for regression problems. This is a least the
case for extra-trees. For random forests, I am no longer sure, I will
check with my advisor.

[1] http://orbi.ulg.ac.be/bitstream/2268/9357/1/geurts-mlj-advance.pdf

Anyway, I agree that this paragraph may be confusing wrt to default
values in the code and would need some changes.

> 2)
> I noticed max_depth defaults to 10 in RandomForests, while the narrative
> docs say
> that max_dept = None yields best results. Is the default value chosen
> because
> "None" might take to long?

This was set to 10 because in our first implementations, it may indeed
have taken too long. It was not changed since.

I nearly always use max_depth=None in practice and instead use
min_split to control the depth of the tree. I agree we should make
max_depth=None by default.

> 3)
> In the RandomForest docs, it's not clear to me from the documentation which
> parameters are parameters of the ensemble and which are parameters of the
> base estimator. I think that should be made more explicit.

Agreed, we should make that more explicit.

> 4) Understanding the parameters "min density" took me some time,
> in particular because I didn't see that it was a parameter of the
> base estimator, not the ensemble. I think the docstring should start with
> "This parameter trades runtime against memory requirement of the
> base decision tree." or similar.

Agreed.

> 5) I think an explanation of "bootstrap" should go in the docs.
> The docstring just states "Whether bootstrap samples are used when
> building trees."
> I don't think this is very helpful since "bootstrap" is quite hard to
> look up for
> an outsider.

Okay, we should make that more explicit. I didn't realize it was obscure.

> 6) As far as I can see, it is possible to set "bootstrap" to 'False' and
> still
> have max_features = n_features.
> This would build n_estimator estimators that are identical, right?
> I think this option should somehow be excluded.

Using random forests, yes they would identical. They wouldn't for extra-trees.

>
> Minor remarks that I'll fix if no-one objects:
>
> - All Forest classifiers should have Trees in the "see also section"

Agreed.

>
>
> Answers / comments welcome :)
>
> Cheers,
> Andy
>
>
> ------------------------------------------------------------------------------
> Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex
> infrastructure or vast IT resources to deliver seamless, secure access to
> virtual desktops. With this all-in-one solution, easily deploy virtual
> desktops for less than the cost of PCs and save 60% on VDI infrastructure
> costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex
infrastructure or vast IT resources to deliver seamless, secure access to
virtual desktops. With this all-in-one solution, easily deploy virtual 
desktops for less than the cost of PCs and save 60% on VDI infrastructure 
costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Question and comments on RandomForests

Reply via email to