There is always a shuffling when iteration over the features (even when going to all features). So in the case of a tie the split will be done on the first feature encounter which will be different due to the shuffling.
There is a PR which was intending to make the algorithm deterministic to always select the same feature in the case of tie. On Sun, 28 Oct 2018 at 09:22, Fernando Marcos Wittmann < fernando.wittm...@gmail.com> wrote: > The random_state is used in the splitters: > > SPLITTERS = SPARSE_SPLITTERS if issparse(X) else DENSE_SPLITTERS > > splitter = self.splitter > if not isinstance(self.splitter, Splitter): > splitter = SPLITTERS[self.splitter](criterion, > self.max_features_, > min_samples_leaf, > min_weight_leaf, > random_state, > self.presort) > > Which is defined as: > > DENSE_SPLITTERS = {"best": _splitter.BestSplitter, > "random": _splitter.RandomSplitter} > > SPARSE_SPLITTERS = {"best": _splitter.BestSparseSplitter, > "random": _splitter.RandomSparseSplitter} > > > Both 'best' and 'random' uses random states. The DecisionTreeClassifier > uses 'best' as default `splitter` parameter. I am not sure how this 'best' > strategy was defined. The docs define as "Supported strategies are “best”. > > > > > On Sun, Oct 28, 2018 at 9:32 AM Piotr Szymański <nied...@gmail.com> wrote: > >> Just a small side note that I've come across with Random Forests which in >> the end form an ensemble of Decision Trees. I ran a thousand iterations of >> RFs on multi-label data and managed to get a 4-10 percentage points >> difference in subset accuracy, depending on the data set, just as a random >> effect, while I've seen papers report differences of just a couple pp as >> statistically significant after a non-parametric rank test. >> >> On Sun, Oct 28, 2018 at 7:44 AM Sebastian Raschka < >> m...@sebastianraschka.com> wrote: >> >>> Good suggestion. The trees look different. I.e., there seems to be a tie >>> at some point between choosing X[:, 0] <= 4.95 and X[:, 3] <= 1.65 >>> >>> So, I suspect that the features are shuffled, let's call it X_shuffled. >>> Then at some point the max_features are selected, which is by default >>> X_shuffled[:, :n_features]. Based on that, if there's a tie between >>> impurities for the different features, it's probably selecting the first >>> feature in the array among these ties. >>> >>> If this is true (have to look into the code more deeply then) I wonder >>> if it would be worthwhile to change the implementation such that the >>> shuffling only occurs if max_features < n_feature, because this way we >>> could have deterministic behavior for the trees by default, which I'd find >>> more intuitive for plain decision trees tbh. >>> >>> Let me know what you all think. >>> >>> Best, >>> Sebastian >>> >>> > On Oct 27, 2018, at 11:07 PM, Julio Antonio Soto de Vicente < >>> ju...@esbet.es> wrote: >>> > >>> > Hmmm that’s weird... >>> > >>> > Have you tried to plot the trees (the decision rules) for the tree >>> with different seeds, and see if the gain for the first split is the same >>> even if the split itself is different? >>> > >>> > I’d at least try that before diving into the source code... >>> > >>> > Cheers, >>> > >>> > -- >>> > Julio >>> > >>> >> El 28 oct 2018, a las 2:24, Sebastian Raschka < >>> m...@sebastianraschka.com> escribió: >>> >> >>> >> Thanks, Javier, >>> >> >>> >> however, the max_features is n_features by default. But if you >>> execute sth like >>> >> >>> >> import numpy as np >>> >> from sklearn.datasets import load_iris >>> >> from sklearn.model_selection import train_test_split >>> >> from sklearn.tree import DecisionTreeClassifier >>> >> >>> >> iris = load_iris() >>> >> X, y = iris.data, iris.target >>> >> X_train, X_test, y_train, y_test = train_test_split(X, y, >>> >> test_size=0.3, >>> >> random_state=123, >>> >> shuffle=True, >>> >> stratify=y) >>> >> >>> >> for i in range(20): >>> >> tree = DecisionTreeClassifier() >>> >> tree.fit(X_train, y_train) >>> >> print(tree.score(X_test, y_test)) >>> >> >>> >> >>> >> >>> >> You will find that the tree will produce different results if you >>> don't fix the random seed. I suspect, related to what you said about the >>> random feature selection if max_features is not n_features, that there is >>> generally some sorting of the features going on, and the different trees >>> are then due to tie-breaking if two features have the same information gain? >>> >> >>> >> Best, >>> >> Sebastian >>> >> >>> >> >>> >> >>> >>> On Oct 27, 2018, at 6:16 PM, Javier López <jlo...@ende.cc> wrote: >>> >>> >>> >>> Hi Sebastian, >>> >>> >>> >>> I think the random state is used to select the features that go into >>> each split (look at the `max_features` parameter) >>> >>> >>> >>> Cheers, >>> >>> Javier >>> >>> >>> >>> On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka < >>> m...@sebastianraschka.com> wrote: >>> >>> Hi all, >>> >>> >>> >>> when I was implementing a bagging classifier based on scikit-learn's >>> DecisionTreeClassifier, I noticed that the results were not deterministic >>> and found that this was due to the random_state in the >>> DescisionTreeClassifier (which is set to None by default). >>> >>> >>> >>> I am wondering what exactly this random state is used for? I can >>> imaging it being used for resolving ties if the information gain for >>> multiple features is the same, or it could be that the feature splits of >>> continuous features is different? (I thought the heuristic is to sort the >>> features and to consider those feature values next to each associated with >>> examples that have different class labels -- but is there maybe some random >>> subselection involved?) >>> >>> >>> >>> If someone knows more about this, where the random_state is used, >>> I'd be happy to hear it :) >>> >>> >>> >>> Also, we could then maybe add the info to the >>> DecisionTreeClassifier's docstring, which is currently a bit too generic to >>> be useful, I think: >>> >>> >>> >>> >>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py >>> >>> >>> >>> >>> >>> random_state : int, RandomState instance or None, optional >>> (default=None) >>> >>> If int, random_state is the seed used by the random number >>> generator; >>> >>> If RandomState instance, random_state is the random number >>> generator; >>> >>> If None, the random number generator is the RandomState >>> instance used >>> >>> by `np.random`. >>> >>> >>> >>> >>> >>> Best, >>> >>> Sebastian >>> >>> _______________________________________________ >>> >>> scikit-learn mailing list >>> >>> scikit-learn@python.org >>> >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> >>> scikit-learn mailing list >>> >>> scikit-learn@python.org >>> >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >>> >> _______________________________________________ >>> >> scikit-learn mailing list >>> >> scikit-learn@python.org >>> >> https://mail.python.org/mailman/listinfo/scikit-learn >>> > _______________________________________________ >>> > scikit-learn mailing list >>> > scikit-learn@python.org >>> > https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> >> -- >> Piotr Szymański >> nied...@gmail.com >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -- > > Fernando Marcos Wittmann > MS Student - Energy Systems Dept. > School of Electrical and Computer Engineering, FEEC > University of Campinas, UNICAMP, Brazil > +55 (19) 987-211302 > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn