Hi everybody.
Recently, I started working with the RandomForest modules and there is a 
couple of things that I noticed
that I would like to change.
So this in particularly goes out to @glouppe, who is the expert on the 
field :)

1)
The narrative docs say that max_features=n_features is a good value for 
RandomForests.
As far as I know, Breiman 2001 suggests max_features = 
log_2(n_features). I also
saw a claim that Breiman 2001 suggests max_features = sqrt(n_features) but I
couldn't find that in the paper.
I just tried "digits" and max_features = log_2(n_features) works better than
max_featurs = n_features. Of course that is definitely no conclusive 
evidence ;)
Is there any reference that says max_features = n_features is good?

Also, I think this default value contradicts the beginning of the 
narrative docs a bit,
since that claims "In addition, when splitting a node during the 
construction of the tree,
the split that is chosen is no longer the best split among all features.
Instead, the split that is picked is the best split among a random 
subset of the features."
Later, a recommendation on using max_features = n_features is made, but
no connection to the explanation above is given.

2)
I noticed max_depth defaults to 10 in RandomForests, while the narrative 
docs say
that max_dept = None yields best results. Is the default value chosen 
because
"None" might take to long?

3)
In the RandomForest docs, it's not clear to me from the documentation which
parameters are parameters of the ensemble and which are parameters of the
base estimator. I think that should be made more explicit.

4) Understanding the parameters "min density" took me some time,
in particular because I didn't see that it was a parameter of the
base estimator, not the ensemble. I think the docstring should start with
"This parameter trades runtime against memory requirement of the
base decision tree." or similar.

5) I think an explanation of "bootstrap" should go in the docs.
The docstring just states "Whether bootstrap samples are used when 
building trees."
I don't think this is very helpful since "bootstrap" is quite hard to 
look up for
an outsider.

6) As far as I can see, it is possible to set "bootstrap" to 'False' and 
still
have max_features = n_features.
This would build n_estimator estimators that are identical, right?
I think this option should somehow be excluded.


Minor remarks that I'll fix if no-one objects:

- All Forest classifiers should have Trees in the "see also section"


Answers / comments welcome :)

Cheers,
Andy


------------------------------------------------------------------------------
Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex
infrastructure or vast IT resources to deliver seamless, secure access to
virtual desktops. With this all-in-one solution, easily deploy virtual 
desktops for less than the cost of PCs and save 60% on VDI infrastructure 
costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to