FYI: https://github.com/scikit-learn/scikit-learn/pull/12364
On Sun, 28 Oct 2018 at 09:32, Guillaume Lemaître <g.lemaitr...@gmail.com> wrote: > There is always a shuffling when iteration over the features (even when > going to all features). > So in the case of a tie the split will be done on the first feature > encounter which will be different due to the shuffling. > > There is a PR which was intending to make the algorithm deterministic to > always select the same feature in the case of tie. > > On Sun, 28 Oct 2018 at 09:22, Fernando Marcos Wittmann < > fernando.wittm...@gmail.com> wrote: > >> The random_state is used in the splitters: >> >> SPLITTERS = SPARSE_SPLITTERS if issparse(X) else DENSE_SPLITTERS >> >> splitter = self.splitter >> if not isinstance(self.splitter, Splitter): >> splitter = SPLITTERS[self.splitter](criterion, >> self.max_features_, >> min_samples_leaf, >> min_weight_leaf, >> random_state, >> self.presort) >> >> Which is defined as: >> >> DENSE_SPLITTERS = {"best": _splitter.BestSplitter, >> "random": _splitter.RandomSplitter} >> >> SPARSE_SPLITTERS = {"best": _splitter.BestSparseSplitter, >> "random": _splitter.RandomSparseSplitter} >> >> >> Both 'best' and 'random' uses random states. The DecisionTreeClassifier >> uses 'best' as default `splitter` parameter. I am not sure how this 'best' >> strategy was defined. The docs define as "Supported strategies are “best”. >> >> >> >> >> On Sun, Oct 28, 2018 at 9:32 AM Piotr Szymański <nied...@gmail.com> >> wrote: >> >>> Just a small side note that I've come across with Random Forests which >>> in the end form an ensemble of Decision Trees. I ran a thousand iterations >>> of RFs on multi-label data and managed to get a 4-10 percentage points >>> difference in subset accuracy, depending on the data set, just as a random >>> effect, while I've seen papers report differences of just a couple pp as >>> statistically significant after a non-parametric rank test. >>> >>> On Sun, Oct 28, 2018 at 7:44 AM Sebastian Raschka < >>> m...@sebastianraschka.com> wrote: >>> >>>> Good suggestion. The trees look different. I.e., there seems to be a >>>> tie at some point between choosing X[:, 0] <= 4.95 and X[:, 3] <= 1.65 >>>> >>>> So, I suspect that the features are shuffled, let's call it X_shuffled. >>>> Then at some point the max_features are selected, which is by default >>>> X_shuffled[:, :n_features]. Based on that, if there's a tie between >>>> impurities for the different features, it's probably selecting the first >>>> feature in the array among these ties. >>>> >>>> If this is true (have to look into the code more deeply then) I wonder >>>> if it would be worthwhile to change the implementation such that the >>>> shuffling only occurs if max_features < n_feature, because this way we >>>> could have deterministic behavior for the trees by default, which I'd find >>>> more intuitive for plain decision trees tbh. >>>> >>>> Let me know what you all think. >>>> >>>> Best, >>>> Sebastian >>>> >>>> > On Oct 27, 2018, at 11:07 PM, Julio Antonio Soto de Vicente < >>>> ju...@esbet.es> wrote: >>>> > >>>> > Hmmm that’s weird... >>>> > >>>> > Have you tried to plot the trees (the decision rules) for the tree >>>> with different seeds, and see if the gain for the first split is the same >>>> even if the split itself is different? >>>> > >>>> > I’d at least try that before diving into the source code... >>>> > >>>> > Cheers, >>>> > >>>> > -- >>>> > Julio >>>> > >>>> >> El 28 oct 2018, a las 2:24, Sebastian Raschka < >>>> m...@sebastianraschka.com> escribió: >>>> >> >>>> >> Thanks, Javier, >>>> >> >>>> >> however, the max_features is n_features by default. But if you >>>> execute sth like >>>> >> >>>> >> import numpy as np >>>> >> from sklearn.datasets import load_iris >>>> >> from sklearn.model_selection import train_test_split >>>> >> from sklearn.tree import DecisionTreeClassifier >>>> >> >>>> >> iris = load_iris() >>>> >> X, y = iris.data, iris.target >>>> >> X_train, X_test, y_train, y_test = train_test_split(X, y, >>>> >> test_size=0.3, >>>> >> random_state=123, >>>> >> shuffle=True, >>>> >> stratify=y) >>>> >> >>>> >> for i in range(20): >>>> >> tree = DecisionTreeClassifier() >>>> >> tree.fit(X_train, y_train) >>>> >> print(tree.score(X_test, y_test)) >>>> >> >>>> >> >>>> >> >>>> >> You will find that the tree will produce different results if you >>>> don't fix the random seed. I suspect, related to what you said about the >>>> random feature selection if max_features is not n_features, that there is >>>> generally some sorting of the features going on, and the different trees >>>> are then due to tie-breaking if two features have the same information >>>> gain? >>>> >> >>>> >> Best, >>>> >> Sebastian >>>> >> >>>> >> >>>> >> >>>> >>> On Oct 27, 2018, at 6:16 PM, Javier López <jlo...@ende.cc> wrote: >>>> >>> >>>> >>> Hi Sebastian, >>>> >>> >>>> >>> I think the random state is used to select the features that go >>>> into each split (look at the `max_features` parameter) >>>> >>> >>>> >>> Cheers, >>>> >>> Javier >>>> >>> >>>> >>> On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka < >>>> m...@sebastianraschka.com> wrote: >>>> >>> Hi all, >>>> >>> >>>> >>> when I was implementing a bagging classifier based on >>>> scikit-learn's DecisionTreeClassifier, I noticed that the results were not >>>> deterministic and found that this was due to the random_state in the >>>> DescisionTreeClassifier (which is set to None by default). >>>> >>> >>>> >>> I am wondering what exactly this random state is used for? I can >>>> imaging it being used for resolving ties if the information gain for >>>> multiple features is the same, or it could be that the feature splits of >>>> continuous features is different? (I thought the heuristic is to sort the >>>> features and to consider those feature values next to each associated with >>>> examples that have different class labels -- but is there maybe some random >>>> subselection involved?) >>>> >>> >>>> >>> If someone knows more about this, where the random_state is used, >>>> I'd be happy to hear it :) >>>> >>> >>>> >>> Also, we could then maybe add the info to the >>>> DecisionTreeClassifier's docstring, which is currently a bit too generic to >>>> be useful, I think: >>>> >>> >>>> >>> >>>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py >>>> >>> >>>> >>> >>>> >>> random_state : int, RandomState instance or None, optional >>>> (default=None) >>>> >>> If int, random_state is the seed used by the random number >>>> generator; >>>> >>> If RandomState instance, random_state is the random number >>>> generator; >>>> >>> If None, the random number generator is the RandomState >>>> instance used >>>> >>> by `np.random`. >>>> >>> >>>> >>> >>>> >>> Best, >>>> >>> Sebastian >>>> >>> _______________________________________________ >>>> >>> scikit-learn mailing list >>>> >>> scikit-learn@python.org >>>> >>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> _______________________________________________ >>>> >>> scikit-learn mailing list >>>> >>> scikit-learn@python.org >>>> >>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >> >>>> >> _______________________________________________ >>>> >> scikit-learn mailing list >>>> >> scikit-learn@python.org >>>> >> https://mail.python.org/mailman/listinfo/scikit-learn >>>> > _______________________________________________ >>>> > scikit-learn mailing list >>>> > scikit-learn@python.org >>>> > https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn@python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> >>> >>> -- >>> Piotr Szymański >>> nied...@gmail.com >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> >> -- >> >> Fernando Marcos Wittmann >> MS Student - Energy Systems Dept. >> School of Electrical and Computer Engineering, FEEC >> University of Campinas, UNICAMP, Brazil >> +55 (19) 987-211302 >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ > -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn