Re: [scikit-learn] How does the random state influence the decision tree splits?

Sebastian Raschka Sun, 28 Oct 2018 09:23:52 -0700

That's nice to know, thanks a lot for the reference!

Best,
Sebastian


> On Oct 28, 2018, at 3:34 AM, Guillaume Lemaître <g.lemaitr...@gmail.com> 
> wrote:
> 
> FYI: https://github.com/scikit-learn/scikit-learn/pull/12364
> 
> On Sun, 28 Oct 2018 at 09:32, Guillaume Lemaître <g.lemaitr...@gmail.com> 
> wrote:
> There is always a shuffling when iteration over the features (even when going 
> to all features).
> So in the case of a tie the split will be done on the first feature encounter 
> which will be different due to the shuffling.
> 
> There is a PR which was intending to make the algorithm deterministic to 
> always select the same feature in the case of tie.
> 
> On Sun, 28 Oct 2018 at 09:22, Fernando Marcos Wittmann 
> <fernando.wittm...@gmail.com> wrote:
> The random_state is used in the splitters:
> 
>         SPLITTERS = SPARSE_SPLITTERS if issparse(X) else DENSE_SPLITTERS
> 
>         splitter = self.splitter
>         if not isinstance(self.splitter, Splitter):
>             splitter = SPLITTERS[self.splitter](criterion,
>                                                 self.max_features_,
>                                                 min_samples_leaf,
>                                                 min_weight_leaf,
>                                                 random_state,
>                                                 self.presort)
> 
> Which is defined as:
> 
> DENSE_SPLITTERS = {"best": _splitter.BestSplitter,
>                    "random": _splitter.RandomSplitter}
> 
> SPARSE_SPLITTERS = {"best": _splitter.BestSparseSplitter,
>                     "random": _splitter.RandomSparseSplitter}
> 
> Both 'best' and 'random' uses random states. The DecisionTreeClassifier uses 
> 'best' as default `splitter` parameter. I am not sure how this 'best' 
> strategy was defined. The docs define as "Supported strategies are “best”. 
> 
> 
> 
> 
> On Sun, Oct 28, 2018 at 9:32 AM Piotr Szymański <nied...@gmail.com> wrote:
> Just a small side note that I've come across with Random Forests which in the 
> end form an ensemble of Decision Trees. I ran a thousand iterations of RFs on 
> multi-label data and managed to get a 4-10 percentage points difference in 
> subset accuracy, depending on the data set, just as a random effect, while 
> I've seen papers report differences of just a couple pp as statistically 
> significant after a non-parametric rank test. 
> 
> On Sun, Oct 28, 2018 at 7:44 AM Sebastian Raschka <m...@sebastianraschka.com> 
> wrote:
> Good suggestion. The trees look different. I.e., there seems to be a tie at 
> some point between choosing X[:, 0] <= 4.95 and X[:, 3] <= 1.65
> 
> So, I suspect that the features are shuffled, let's call it X_shuffled. Then 
> at some point the max_features are selected, which is by default 
> X_shuffled[:, :n_features]. Based on that, if there's a tie between 
> impurities for the different features, it's probably selecting the first 
> feature in the array among these ties.
> 
> If this is true (have to look into the code more deeply then) I wonder if it 
> would be worthwhile to change the implementation such that the shuffling only 
> occurs if  max_features < n_feature, because this way we could have 
> deterministic behavior for the trees by default, which I'd find more 
> intuitive for plain decision trees tbh.
> 
> Let me know what you all think.
> 
> Best,
> Sebastian
> 
> > On Oct 27, 2018, at 11:07 PM, Julio Antonio Soto de Vicente 
> > <ju...@esbet.es> wrote:
> > 
> > Hmmm that’s weird...
> > 
> > Have you tried to plot the trees (the decision rules) for the tree with 
> > different seeds, and see if the gain for the first split is the same even 
> > if the split itself is different?
> > 
> > I’d at least try that before diving into the source code...
> > 
> > Cheers,
> > 
> > --
> > Julio
> > 
> >> El 28 oct 2018, a las 2:24, Sebastian Raschka <m...@sebastianraschka.com> 
> >> escribió:
> >> 
> >> Thanks, Javier,
> >> 
> >> however, the max_features is n_features by default. But if you execute sth 
> >> like
> >> 
> >> import numpy as np
> >> from sklearn.datasets import load_iris
> >> from sklearn.model_selection import train_test_split
> >> from sklearn.tree import DecisionTreeClassifier
> >> 
> >> iris = load_iris()
> >> X, y = iris.data, iris.target
> >> X_train, X_test, y_train, y_test = train_test_split(X, y,
> >>                                                   test_size=0.3,
> >>                                                   random_state=123,
> >>                                                   shuffle=True,
> >>                                                   stratify=y)
> >> 
> >> for i in range(20):
> >>   tree = DecisionTreeClassifier()
> >>   tree.fit(X_train, y_train)
> >>   print(tree.score(X_test, y_test))
> >> 
> >> 
> >> 
> >> You will find that the tree will produce different results if you don't 
> >> fix the random seed. I suspect, related to what you said about the random 
> >> feature selection if max_features is not n_features, that there is 
> >> generally some sorting of the features going on, and the different trees 
> >> are then due to tie-breaking if two features have the same information 
> >> gain?
> >> 
> >> Best,
> >> Sebastian
> >> 
> >> 
> >> 
> >>> On Oct 27, 2018, at 6:16 PM, Javier López <jlo...@ende.cc> wrote:
> >>> 
> >>> Hi Sebastian,
> >>> 
> >>> I think the random state is used to select the features that go into each 
> >>> split (look at the `max_features` parameter)
> >>> 
> >>> Cheers,
> >>> Javier
> >>> 
> >>> On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka 
> >>> <m...@sebastianraschka.com> wrote:
> >>> Hi all,
> >>> 
> >>> when I was implementing a bagging classifier based on scikit-learn's 
> >>> DecisionTreeClassifier, I noticed that the results were not deterministic 
> >>> and found that this was due to the random_state in the 
> >>> DescisionTreeClassifier (which is set to None by default).
> >>> 
> >>> I am wondering what exactly this random state is used for? I can imaging 
> >>> it being used for resolving ties if the information gain for multiple 
> >>> features is the same, or it could be that the feature splits of 
> >>> continuous features is different? (I thought the heuristic is to sort the 
> >>> features and to consider those feature values next to each associated 
> >>> with examples that have different class labels -- but is there maybe some 
> >>> random subselection involved?)
> >>> 
> >>> If someone knows more about this, where the random_state is used, I'd be 
> >>> happy to hear it :)
> >>> 
> >>> Also, we could then maybe add the info to the DecisionTreeClassifier's 
> >>> docstring, which is currently a bit too generic to be useful, I think:
> >>> 
> >>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py
> >>> 
> >>> 
> >>>   random_state : int, RandomState instance or None, optional 
> >>> (default=None)
> >>>       If int, random_state is the seed used by the random number 
> >>> generator;
> >>>       If RandomState instance, random_state is the random number 
> >>> generator;
> >>>       If None, the random number generator is the RandomState instance 
> >>> used
> >>>       by `np.random`.
> >>> 
> >>> 
> >>> Best,
> >>> Sebastian
> >>> _______________________________________________
> >>> scikit-learn mailing list
> >>> scikit-learn@python.org
> >>> https://mail.python.org/mailman/listinfo/scikit-learn
> >>> _______________________________________________
> >>> scikit-learn mailing list
> >>> scikit-learn@python.org
> >>> https://mail.python.org/mailman/listinfo/scikit-learn
> >> 
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn@python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> -- 
> Piotr Szymański
> nied...@gmail.com
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> -- 
> 
> Fernando Marcos Wittmann
> MS Student - Energy Systems Dept. 
> School of Electrical and Computer Engineering, FEEC
> University of Campinas, UNICAMP, Brazil
> +55 (19) 987-211302
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> -- 
> Guillaume Lemaitre
> INRIA Saclay - Parietal team
> Center for Data Science Paris-Saclay
> https://glemaitre.github.io/
> 
> 
> -- 
> Guillaume Lemaitre
> INRIA Saclay - Parietal team
> Center for Data Science Paris-Saclay
> https://glemaitre.github.io/
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] How does the random state influence the decision tree splits?

Reply via email to