Thanks, Javier, however, the max_features is n_features by default. But if you execute sth like
import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier iris = load_iris() X, y = iris.data, iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123, shuffle=True, stratify=y) for i in range(20): tree = DecisionTreeClassifier() tree.fit(X_train, y_train) print(tree.score(X_test, y_test)) You will find that the tree will produce different results if you don't fix the random seed. I suspect, related to what you said about the random feature selection if max_features is not n_features, that there is generally some sorting of the features going on, and the different trees are then due to tie-breaking if two features have the same information gain? Best, Sebastian > On Oct 27, 2018, at 6:16 PM, Javier López <jlo...@ende.cc> wrote: > > Hi Sebastian, > > I think the random state is used to select the features that go into each > split (look at the `max_features` parameter) > > Cheers, > Javier > > On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka > <m...@sebastianraschka.com> wrote: > Hi all, > > when I was implementing a bagging classifier based on scikit-learn's > DecisionTreeClassifier, I noticed that the results were not deterministic and > found that this was due to the random_state in the DescisionTreeClassifier > (which is set to None by default). > > I am wondering what exactly this random state is used for? I can imaging it > being used for resolving ties if the information gain for multiple features > is the same, or it could be that the feature splits of continuous features is > different? (I thought the heuristic is to sort the features and to consider > those feature values next to each associated with examples that have > different class labels -- but is there maybe some random subselection > involved?) > > If someone knows more about this, where the random_state is used, I'd be > happy to hear it :) > > Also, we could then maybe add the info to the DecisionTreeClassifier's > docstring, which is currently a bit too generic to be useful, I think: > > https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py > > > random_state : int, RandomState instance or None, optional (default=None) > If int, random_state is the seed used by the random number generator; > If RandomState instance, random_state is the random number generator; > If None, the random number generator is the RandomState instance used > by `np.random`. > > > Best, > Sebastian > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn