Re: [scikit-learn] How does the random state influence the decision tree splits?

Fernando Marcos Wittmann Sun, 28 Oct 2018 01:23:22 -0700

The random_state is used in the splitters:

        SPLITTERS = SPARSE_SPLITTERS if issparse(X) else DENSE_SPLITTERS


        splitter = self.splitter
        if not isinstance(self.splitter, Splitter):
            splitter = SPLITTERS[self.splitter](criterion,
                                                self.max_features_,
                                                min_samples_leaf,
                                                min_weight_leaf,
                                                random_state,
                                                self.presort)

Which is defined as:

DENSE_SPLITTERS = {"best": _splitter.BestSplitter,
                   "random": _splitter.RandomSplitter}

SPARSE_SPLITTERS = {"best": _splitter.BestSparseSplitter,
                    "random": _splitter.RandomSparseSplitter}


Both 'best' and 'random' uses random states. The DecisionTreeClassifier
uses 'best' as default `splitter` parameter. I am not sure how this 'best'
strategy was defined. The docs define as "Supported strategies are “best”.




On Sun, Oct 28, 2018 at 9:32 AM Piotr Szymański <nied...@gmail.com> wrote:

> Just a small side note that I've come across with Random Forests which in
> the end form an ensemble of Decision Trees. I ran a thousand iterations of
> RFs on multi-label data and managed to get a 4-10 percentage points
> difference in subset accuracy, depending on the data set, just as a random
> effect, while I've seen papers report differences of just a couple pp as
> statistically significant after a non-parametric rank test.
>
> On Sun, Oct 28, 2018 at 7:44 AM Sebastian Raschka <
> m...@sebastianraschka.com> wrote:
>
>> Good suggestion. The trees look different. I.e., there seems to be a tie
>> at some point between choosing X[:, 0] <= 4.95 and X[:, 3] <= 1.65
>>
>> So, I suspect that the features are shuffled, let's call it X_shuffled.
>> Then at some point the max_features are selected, which is by default
>> X_shuffled[:, :n_features]. Based on that, if there's a tie between
>> impurities for the different features, it's probably selecting the first
>> feature in the array among these ties.
>>
>> If this is true (have to look into the code more deeply then) I wonder if
>> it would be worthwhile to change the implementation such that the shuffling
>> only occurs if  max_features < n_feature, because this way we could have
>> deterministic behavior for the trees by default, which I'd find more
>> intuitive for plain decision trees tbh.
>>
>> Let me know what you all think.
>>
>> Best,
>> Sebastian
>>
>> > On Oct 27, 2018, at 11:07 PM, Julio Antonio Soto de Vicente <
>> ju...@esbet.es> wrote:
>> >
>> > Hmmm that’s weird...
>> >
>> > Have you tried to plot the trees (the decision rules) for the tree with
>> different seeds, and see if the gain for the first split is the same even
>> if the split itself is different?
>> >
>> > I’d at least try that before diving into the source code...
>> >
>> > Cheers,
>> >
>> > --
>> > Julio
>> >
>> >> El 28 oct 2018, a las 2:24, Sebastian Raschka <
>> m...@sebastianraschka.com> escribió:
>> >>
>> >> Thanks, Javier,
>> >>
>> >> however, the max_features is n_features by default. But if you execute
>> sth like
>> >>
>> >> import numpy as np
>> >> from sklearn.datasets import load_iris
>> >> from sklearn.model_selection import train_test_split
>> >> from sklearn.tree import DecisionTreeClassifier
>> >>
>> >> iris = load_iris()
>> >> X, y = iris.data, iris.target
>> >> X_train, X_test, y_train, y_test = train_test_split(X, y,
>> >>                                                   test_size=0.3,
>> >>                                                   random_state=123,
>> >>                                                   shuffle=True,
>> >>                                                   stratify=y)
>> >>
>> >> for i in range(20):
>> >>   tree = DecisionTreeClassifier()
>> >>   tree.fit(X_train, y_train)
>> >>   print(tree.score(X_test, y_test))
>> >>
>> >>
>> >>
>> >> You will find that the tree will produce different results if you
>> don't fix the random seed. I suspect, related to what you said about the
>> random feature selection if max_features is not n_features, that there is
>> generally some sorting of the features going on, and the different trees
>> are then due to tie-breaking if two features have the same information gain?
>> >>
>> >> Best,
>> >> Sebastian
>> >>
>> >>
>> >>
>> >>> On Oct 27, 2018, at 6:16 PM, Javier López <jlo...@ende.cc> wrote:
>> >>>
>> >>> Hi Sebastian,
>> >>>
>> >>> I think the random state is used to select the features that go into
>> each split (look at the `max_features` parameter)
>> >>>
>> >>> Cheers,
>> >>> Javier
>> >>>
>> >>> On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka <
>> m...@sebastianraschka.com> wrote:
>> >>> Hi all,
>> >>>
>> >>> when I was implementing a bagging classifier based on scikit-learn's
>> DecisionTreeClassifier, I noticed that the results were not deterministic
>> and found that this was due to the random_state in the
>> DescisionTreeClassifier (which is set to None by default).
>> >>>
>> >>> I am wondering what exactly this random state is used for? I can
>> imaging it being used for resolving ties if the information gain for
>> multiple features is the same, or it could be that the feature splits of
>> continuous features is different? (I thought the heuristic is to sort the
>> features and to consider those feature values next to each associated with
>> examples that have different class labels -- but is there maybe some random
>> subselection involved?)
>> >>>
>> >>> If someone knows more about this, where the random_state is used, I'd
>> be happy to hear it :)
>> >>>
>> >>> Also, we could then maybe add the info to the
>> DecisionTreeClassifier's docstring, which is currently a bit too generic to
>> be useful, I think:
>> >>>
>> >>>
>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py
>> >>>
>> >>>
>> >>>   random_state : int, RandomState instance or None, optional
>> (default=None)
>> >>>       If int, random_state is the seed used by the random number
>> generator;
>> >>>       If RandomState instance, random_state is the random number
>> generator;
>> >>>       If None, the random number generator is the RandomState
>> instance used
>> >>>       by `np.random`.
>> >>>
>> >>>
>> >>> Best,
>> >>> Sebastian
>> >>> _______________________________________________
>> >>> scikit-learn mailing list
>> >>> scikit-learn@python.org
>> >>> https://mail.python.org/mailman/listinfo/scikit-learn
>> >>> _______________________________________________
>> >>> scikit-learn mailing list
>> >>> scikit-learn@python.org
>> >>> https://mail.python.org/mailman/listinfo/scikit-learn
>> >>
>> >> _______________________________________________
>> >> scikit-learn mailing list
>> >> scikit-learn@python.org
>> >> https://mail.python.org/mailman/listinfo/scikit-learn
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn@python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> --
> Piotr Szymański
> nied...@gmail.com
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 

Fernando Marcos Wittmann
MS Student - Energy Systems Dept.
School of Electrical and Computer Engineering, FEEC
University of Campinas, UNICAMP, Brazil
+55 (19) 987-211302

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] How does the random state influence the decision tree splits?

Reply via email to