Dear community, This is a question regarding how to interpret the documentation and semantics of the random forest constructors.
In forest.py (of version 0.17 which I am still using), the documentation regarding the number of features to consider states on lines 742-745 of the source code that the search may effectively inspect more than `max_features` when determining the features to pick from in order to split a node. It also states that it is tree specific. Am I correct in: Interpretation #1 - For bootstrap=True, sampling with replacement occurs for the number of training instances available, meaning that the subsample presented to a particular tree will have some probability of containing overlaps and therefore not the full input training set, but for bootstrap=False, the entire dataset will be presented to each tree? Interpretation #2 - Particularly, with the way I interpret the documentation stating that "The sub-sample size is always the same as the original input sample size...", it seems to me that bootstrap=False then provides the entire training dataset to each decision tree, and it is a matter of which feature was randomly selected first from the features given that determines what the tree will become. That would suggest that, if bootstrap=False, and if the number of trees is high but the feature dimensionality is very low, then there is a high possibility that multiple copies of the same tree will emerge from the forest. Interpretation #3 - the feature subset is not subsampled per tree, but rather all features are presented for the subsampled training data provided to a tree ? For example, if the dimensionality is 400 on a 6000-input training dataset that has randomly been subsampled (with bootstrap=True) to yield 4700 unique training samples, then the tree builder will consider all 400 dimensions/features with respect to the 4700 samples, picking at most `max_features` number of features (out of 400) for building splits in the tree? So by default (sqrt/auto), there would be at most 20 splits in the tree? Confirmations, denials, and corrections to my interpretations are _highly_ welcome. As always, my great thanks to the community. With kind regards, J.B. Brown Kyoto University Graduate School of Medicine
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn