Re: [scikit-learn] How does the random state influence the decision tree splits?

Guillaume Lemaître Sun, 28 Oct 2018 01:34:50 -0700

There is always a shuffling when iteration over the features (even when
going to all features).
So in the case of a tie the split will be done on the first feature
encounter which will be different due to the shuffling.


There is a PR which was intending to make the algorithm deterministic to
always select the same feature in the case of tie.

On Sun, 28 Oct 2018 at 09:22, Fernando Marcos Wittmann <
fernando.wittm...@gmail.com> wrote:

> The random_state is used in the splitters:
>
>         SPLITTERS = SPARSE_SPLITTERS if issparse(X) else DENSE_SPLITTERS
>
>         splitter = self.splitter
>         if not isinstance(self.splitter, Splitter):
>             splitter = SPLITTERS[self.splitter](criterion,
>                                                 self.max_features_,
>                                                 min_samples_leaf,
>                                                 min_weight_leaf,
>                                                 random_state,
>                                                 self.presort)
>
> Which is defined as:
>
> DENSE_SPLITTERS = {"best": _splitter.BestSplitter,
>                    "random": _splitter.RandomSplitter}
>
> SPARSE_SPLITTERS = {"best": _splitter.BestSparseSplitter,
>                     "random": _splitter.RandomSparseSplitter}
>
>
> Both 'best' and 'random' uses random states. The DecisionTreeClassifier
> uses 'best' as default `splitter` parameter. I am not sure how this 'best'
> strategy was defined. The docs define as "Supported strategies are “best”.
>
>
>
>
> On Sun, Oct 28, 2018 at 9:32 AM Piotr Szymański <nied...@gmail.com> wrote:
>
>> Just a small side note that I've come across with Random Forests which in
>> the end form an ensemble of Decision Trees. I ran a thousand iterations of
>> RFs on multi-label data and managed to get a 4-10 percentage points
>> difference in subset accuracy, depending on the data set, just as a random
>> effect, while I've seen papers report differences of just a couple pp as
>> statistically significant after a non-parametric rank test.
>>
>> On Sun, Oct 28, 2018 at 7:44 AM Sebastian Raschka <
>> m...@sebastianraschka.com> wrote:
>>
>>> Good suggestion. The trees look different. I.e., there seems to be a tie
>>> at some point between choosing X[:, 0] <= 4.95 and X[:, 3] <= 1.65
>>>
>>> So, I suspect that the features are shuffled, let's call it X_shuffled.
>>> Then at some point the max_features are selected, which is by default
>>> X_shuffled[:, :n_features]. Based on that, if there's a tie between
>>> impurities for the different features, it's probably selecting the first
>>> feature in the array among these ties.
>>>
>>> If this is true (have to look into the code more deeply then) I wonder
>>> if it would be worthwhile to change the implementation such that the
>>> shuffling only occurs if  max_features < n_feature, because this way we
>>> could have deterministic behavior for the trees by default, which I'd find
>>> more intuitive for plain decision trees tbh.
>>>
>>> Let me know what you all think.
>>>
>>> Best,
>>> Sebastian
>>>
>>> > On Oct 27, 2018, at 11:07 PM, Julio Antonio Soto de Vicente <
>>> ju...@esbet.es> wrote:
>>> >
>>> > Hmmm that’s weird...
>>> >
>>> > Have you tried to plot the trees (the decision rules) for the tree
>>> with different seeds, and see if the gain for the first split is the same
>>> even if the split itself is different?
>>> >
>>> > I’d at least try that before diving into the source code...
>>> >
>>> > Cheers,
>>> >
>>> > --
>>> > Julio
>>> >
>>> >> El 28 oct 2018, a las 2:24, Sebastian Raschka <
>>> m...@sebastianraschka.com> escribió:
>>> >>
>>> >> Thanks, Javier,
>>> >>
>>> >> however, the max_features is n_features by default. But if you
>>> execute sth like
>>> >>
>>> >> import numpy as np
>>> >> from sklearn.datasets import load_iris
>>> >> from sklearn.model_selection import train_test_split
>>> >> from sklearn.tree import DecisionTreeClassifier
>>> >>
>>> >> iris = load_iris()
>>> >> X, y = iris.data, iris.target
>>> >> X_train, X_test, y_train, y_test = train_test_split(X, y,
>>> >>                                                   test_size=0.3,
>>> >>                                                   random_state=123,
>>> >>                                                   shuffle=True,
>>> >>                                                   stratify=y)
>>> >>
>>> >> for i in range(20):
>>> >>   tree = DecisionTreeClassifier()
>>> >>   tree.fit(X_train, y_train)
>>> >>   print(tree.score(X_test, y_test))
>>> >>
>>> >>
>>> >>
>>> >> You will find that the tree will produce different results if you
>>> don't fix the random seed. I suspect, related to what you said about the
>>> random feature selection if max_features is not n_features, that there is
>>> generally some sorting of the features going on, and the different trees
>>> are then due to tie-breaking if two features have the same information gain?
>>> >>
>>> >> Best,
>>> >> Sebastian
>>> >>
>>> >>
>>> >>
>>> >>> On Oct 27, 2018, at 6:16 PM, Javier López <jlo...@ende.cc> wrote:
>>> >>>
>>> >>> Hi Sebastian,
>>> >>>
>>> >>> I think the random state is used to select the features that go into
>>> each split (look at the `max_features` parameter)
>>> >>>
>>> >>> Cheers,
>>> >>> Javier
>>> >>>
>>> >>> On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka <
>>> m...@sebastianraschka.com> wrote:
>>> >>> Hi all,
>>> >>>
>>> >>> when I was implementing a bagging classifier based on scikit-learn's
>>> DecisionTreeClassifier, I noticed that the results were not deterministic
>>> and found that this was due to the random_state in the
>>> DescisionTreeClassifier (which is set to None by default).
>>> >>>
>>> >>> I am wondering what exactly this random state is used for? I can
>>> imaging it being used for resolving ties if the information gain for
>>> multiple features is the same, or it could be that the feature splits of
>>> continuous features is different? (I thought the heuristic is to sort the
>>> features and to consider those feature values next to each associated with
>>> examples that have different class labels -- but is there maybe some random
>>> subselection involved?)
>>> >>>
>>> >>> If someone knows more about this, where the random_state is used,
>>> I'd be happy to hear it :)
>>> >>>
>>> >>> Also, we could then maybe add the info to the
>>> DecisionTreeClassifier's docstring, which is currently a bit too generic to
>>> be useful, I think:
>>> >>>
>>> >>>
>>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py
>>> >>>
>>> >>>
>>> >>>   random_state : int, RandomState instance or None, optional
>>> (default=None)
>>> >>>       If int, random_state is the seed used by the random number
>>> generator;
>>> >>>       If RandomState instance, random_state is the random number
>>> generator;
>>> >>>       If None, the random number generator is the RandomState
>>> instance used
>>> >>>       by `np.random`.
>>> >>>
>>> >>>
>>> >>> Best,
>>> >>> Sebastian
>>> >>> _______________________________________________
>>> >>> scikit-learn mailing list
>>> >>> scikit-learn@python.org
>>> >>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> >>> _______________________________________________
>>> >>> scikit-learn mailing list
>>> >>> scikit-learn@python.org
>>> >>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> >>
>>> >> _______________________________________________
>>> >> scikit-learn mailing list
>>> >> scikit-learn@python.org
>>> >> https://mail.python.org/mailman/listinfo/scikit-learn
>>> > _______________________________________________
>>> > scikit-learn mailing list
>>> > scikit-learn@python.org
>>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>
>>
>> --
>> Piotr Szymański
>> nied...@gmail.com
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> --
>
> Fernando Marcos Wittmann
> MS Student - Energy Systems Dept.
> School of Electrical and Computer Engineering, FEEC
> University of Campinas, UNICAMP, Brazil
> +55 (19) 987-211302
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] How does the random state influence the decision tree splits?

Reply via email to