Howdy This documentation seems to be split between the RandomForestClassifier <http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier> documentation that discusses sampling with replacement and the Ensemble <http://scikit-learn.org/stable/modules/ensemble.html> documentation that discusses the splits. I agree that it could be made more explicit. If you want to build off of those to make sure that both discuss both features, I would be happy to review it.
Jacob On Mon, Jun 5, 2017 at 10:02 PM, Brown J.B. <jbbr...@kuhp.kyoto-u.ac.jp> wrote: > Dear Jacob, > > Thank you for this clarification. It is a great help in interpreting the > (good) results that we are obtaining for computational chemogenomics, and > also help in deciding directions of future studies. > > Perhaps then, the random forest documentation (description web page) could > be updated to reflect our discussion, in that it might help others who have > the same questions of interpretation. > > Perhaps, we can add the following (with notation symbols corrected to > match sklearn standards): > ---------- > In general, for a modeling problem with N training instances each having F > features, a random forest of T trees operates by building T decision trees > such that each tree is provided a subsampling of N instances and the F > features for those subsampled instances. > > When bootstrapping is applied, the instance subsampling can potentially > choose the same instance multiple times, in which such an instance will > have elevated weighting. > When bootstrapping is not applied, the entire training set is provided to > the tree-building algorithm. > > Each tree is built by considering a maximum specified number of > informative features at each decision node, such that features with no > variance are excluded from the features to consider for a split and do not > count toward the number of informative features. > Splitting on the informative features can occur as many times as > necessary, unless a maximum depth is specified in the constructor. > Note that an informative feature can be re-applied to form a decision > criteria at more than node in the decision tree. > ---------- > > Adjustments welcome. > > Many thanks again! > J.B. > > > > 2017-06-06 2:54 GMT+09:00 Jacob Schreiber <jmschreibe...@gmail.com>: > >> Howdy >> >> When doing bootstrapping, n samples are selected from the dataset WITH >> replacement, where n is the number of samples in the dataset. This leads to >> situations where some samples have a weight > 1 and others have a weight of >> 0. This is done separately for each tree. >> >> When selecting the number of features, this should be considered more >> like `max_informative_features.` Essentially, if a tree considers splitting >> a feature that is constant, that won't count against the `max_features` >> threshold that is set. This helps guard against situations where many >> uninformative trees are built because the dataset is full of uninformative >> features. You can see this in the code here: ( >> https://github.com/scikit-learn/scikit-learn/blob/14031f65d >> 144e3966113d3daec836e443c6d7a5b/sklearn/tree/_splitter.pyx#L361). This >> is done on a -per split basis-, meaning that a tree can have more than >> `max_features` number of features considered. >> >> In your example, it is not that there would be at most 20 splits in a >> tree, it is that at each split only 20 informative features would be >> considered. You can split on a feature multiple times (consider the example >> where you have one features and x < 0 is class 0, 0 <= x <= 10 is class 1, >> and x > 10 is class 0 again). >> >> Let me know if you have any other questions! >> >> On Mon, Jun 5, 2017 at 7:46 AM, Brown J.B. <jbbr...@kuhp.kyoto-u.ac.jp> >> wrote: >> >>> Dear community, >>> >>> This is a question regarding how to interpret the documentation and >>> semantics of the random forest constructors. >>> >>> In forest.py (of version 0.17 which I am still using), the documentation >>> regarding the number of features to consider states on lines 742-745 of the >>> source code that the search may effectively inspect more than >>> `max_features` when determining the features to pick from in order to split >>> a node. >>> It also states that it is tree specific. >>> >>> Am I correct in: >>> >>> Interpretation #1 - For bootstrap=True, sampling with replacement occurs >>> for the number of training instances available, meaning that the subsample >>> presented to a particular tree will have some probability of containing >>> overlaps and therefore not the full input training set, but for >>> bootstrap=False, the entire dataset will be presented to each tree? >>> >>> Interpretation #2 - Particularly, with the way I interpret the >>> documentation stating that "The sub-sample size is always the same as the >>> original input sample size...", it seems to me that bootstrap=False then >>> provides the entire training dataset to each decision tree, and it is a >>> matter of which feature was randomly selected first from the features given >>> that determines what the tree will become. >>> That would suggest that, if bootstrap=False, and if the number of trees >>> is high but the feature dimensionality is very low, then there is a high >>> possibility that multiple copies of the same tree will emerge from the >>> forest. >>> >>> Interpretation #3 - the feature subset is not subsampled per tree, but >>> rather all features are presented for the subsampled training data provided >>> to a tree ? For example, if the dimensionality is 400 on a 6000-input >>> training dataset that has randomly been subsampled (with bootstrap=True) to >>> yield 4700 unique training samples, then the tree builder will consider all >>> 400 dimensions/features with respect to the 4700 samples, picking at most >>> `max_features` number of features (out of 400) for building splits in the >>> tree? So by default (sqrt/auto), there would be at most 20 splits in the >>> tree? >>> >>> Confirmations, denials, and corrections to my interpretations are >>> _highly_ welcome. >>> >>> As always, my great thanks to the community. >>> >>> With kind regards, >>> J.B. Brown >>> Kyoto University Graduate School of Medicine >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn