The random_state is used in the splitters: SPLITTERS = SPARSE_SPLITTERS if issparse(X) else DENSE_SPLITTERS
splitter = self.splitter if not isinstance(self.splitter, Splitter): splitter = SPLITTERS[self.splitter](criterion, self.max_features_, min_samples_leaf, min_weight_leaf, random_state, self.presort) Which is defined as: DENSE_SPLITTERS = {"best": _splitter.BestSplitter, "random": _splitter.RandomSplitter} SPARSE_SPLITTERS = {"best": _splitter.BestSparseSplitter, "random": _splitter.RandomSparseSplitter} Both 'best' and 'random' uses random states. The DecisionTreeClassifier uses 'best' as default `splitter` parameter. I am not sure how this 'best' strategy was defined. The docs define as "Supported strategies are “best”. On Sun, Oct 28, 2018 at 9:32 AM Piotr Szymański <nied...@gmail.com> wrote: > Just a small side note that I've come across with Random Forests which in > the end form an ensemble of Decision Trees. I ran a thousand iterations of > RFs on multi-label data and managed to get a 4-10 percentage points > difference in subset accuracy, depending on the data set, just as a random > effect, while I've seen papers report differences of just a couple pp as > statistically significant after a non-parametric rank test. > > On Sun, Oct 28, 2018 at 7:44 AM Sebastian Raschka < > m...@sebastianraschka.com> wrote: > >> Good suggestion. The trees look different. I.e., there seems to be a tie >> at some point between choosing X[:, 0] <= 4.95 and X[:, 3] <= 1.65 >> >> So, I suspect that the features are shuffled, let's call it X_shuffled. >> Then at some point the max_features are selected, which is by default >> X_shuffled[:, :n_features]. Based on that, if there's a tie between >> impurities for the different features, it's probably selecting the first >> feature in the array among these ties. >> >> If this is true (have to look into the code more deeply then) I wonder if >> it would be worthwhile to change the implementation such that the shuffling >> only occurs if max_features < n_feature, because this way we could have >> deterministic behavior for the trees by default, which I'd find more >> intuitive for plain decision trees tbh. >> >> Let me know what you all think. >> >> Best, >> Sebastian >> >> > On Oct 27, 2018, at 11:07 PM, Julio Antonio Soto de Vicente < >> ju...@esbet.es> wrote: >> > >> > Hmmm that’s weird... >> > >> > Have you tried to plot the trees (the decision rules) for the tree with >> different seeds, and see if the gain for the first split is the same even >> if the split itself is different? >> > >> > I’d at least try that before diving into the source code... >> > >> > Cheers, >> > >> > -- >> > Julio >> > >> >> El 28 oct 2018, a las 2:24, Sebastian Raschka < >> m...@sebastianraschka.com> escribió: >> >> >> >> Thanks, Javier, >> >> >> >> however, the max_features is n_features by default. But if you execute >> sth like >> >> >> >> import numpy as np >> >> from sklearn.datasets import load_iris >> >> from sklearn.model_selection import train_test_split >> >> from sklearn.tree import DecisionTreeClassifier >> >> >> >> iris = load_iris() >> >> X, y = iris.data, iris.target >> >> X_train, X_test, y_train, y_test = train_test_split(X, y, >> >> test_size=0.3, >> >> random_state=123, >> >> shuffle=True, >> >> stratify=y) >> >> >> >> for i in range(20): >> >> tree = DecisionTreeClassifier() >> >> tree.fit(X_train, y_train) >> >> print(tree.score(X_test, y_test)) >> >> >> >> >> >> >> >> You will find that the tree will produce different results if you >> don't fix the random seed. I suspect, related to what you said about the >> random feature selection if max_features is not n_features, that there is >> generally some sorting of the features going on, and the different trees >> are then due to tie-breaking if two features have the same information gain? >> >> >> >> Best, >> >> Sebastian >> >> >> >> >> >> >> >>> On Oct 27, 2018, at 6:16 PM, Javier López <jlo...@ende.cc> wrote: >> >>> >> >>> Hi Sebastian, >> >>> >> >>> I think the random state is used to select the features that go into >> each split (look at the `max_features` parameter) >> >>> >> >>> Cheers, >> >>> Javier >> >>> >> >>> On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka < >> m...@sebastianraschka.com> wrote: >> >>> Hi all, >> >>> >> >>> when I was implementing a bagging classifier based on scikit-learn's >> DecisionTreeClassifier, I noticed that the results were not deterministic >> and found that this was due to the random_state in the >> DescisionTreeClassifier (which is set to None by default). >> >>> >> >>> I am wondering what exactly this random state is used for? I can >> imaging it being used for resolving ties if the information gain for >> multiple features is the same, or it could be that the feature splits of >> continuous features is different? (I thought the heuristic is to sort the >> features and to consider those feature values next to each associated with >> examples that have different class labels -- but is there maybe some random >> subselection involved?) >> >>> >> >>> If someone knows more about this, where the random_state is used, I'd >> be happy to hear it :) >> >>> >> >>> Also, we could then maybe add the info to the >> DecisionTreeClassifier's docstring, which is currently a bit too generic to >> be useful, I think: >> >>> >> >>> >> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py >> >>> >> >>> >> >>> random_state : int, RandomState instance or None, optional >> (default=None) >> >>> If int, random_state is the seed used by the random number >> generator; >> >>> If RandomState instance, random_state is the random number >> generator; >> >>> If None, the random number generator is the RandomState >> instance used >> >>> by `np.random`. >> >>> >> >>> >> >>> Best, >> >>> Sebastian >> >>> _______________________________________________ >> >>> scikit-learn mailing list >> >>> scikit-learn@python.org >> >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >>> _______________________________________________ >> >>> scikit-learn mailing list >> >>> scikit-learn@python.org >> >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> >> scikit-learn mailing list >> >> scikit-learn@python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn@python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -- > Piotr Szymański > nied...@gmail.com > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Fernando Marcos Wittmann MS Student - Energy Systems Dept. School of Electrical and Computer Engineering, FEEC University of Campinas, UNICAMP, Brazil +55 (19) 987-211302
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn