Just a small side note that I've come across with Random Forests which in
the end form an ensemble of Decision Trees. I ran a thousand iterations of
RFs on multi-label data and managed to get a 4-10 percentage points
difference in subset accuracy, depending on the data set, just as a random
effect, while I've seen papers report differences of just a couple pp as
statistically significant after a non-parametric rank test.

On Sun, Oct 28, 2018 at 7:44 AM Sebastian Raschka <m...@sebastianraschka.com>
wrote:

> Good suggestion. The trees look different. I.e., there seems to be a tie
> at some point between choosing X[:, 0] <= 4.95 and X[:, 3] <= 1.65
>
> So, I suspect that the features are shuffled, let's call it X_shuffled.
> Then at some point the max_features are selected, which is by default
> X_shuffled[:, :n_features]. Based on that, if there's a tie between
> impurities for the different features, it's probably selecting the first
> feature in the array among these ties.
>
> If this is true (have to look into the code more deeply then) I wonder if
> it would be worthwhile to change the implementation such that the shuffling
> only occurs if  max_features < n_feature, because this way we could have
> deterministic behavior for the trees by default, which I'd find more
> intuitive for plain decision trees tbh.
>
> Let me know what you all think.
>
> Best,
> Sebastian
>
> > On Oct 27, 2018, at 11:07 PM, Julio Antonio Soto de Vicente <
> ju...@esbet.es> wrote:
> >
> > Hmmm that’s weird...
> >
> > Have you tried to plot the trees (the decision rules) for the tree with
> different seeds, and see if the gain for the first split is the same even
> if the split itself is different?
> >
> > I’d at least try that before diving into the source code...
> >
> > Cheers,
> >
> > --
> > Julio
> >
> >> El 28 oct 2018, a las 2:24, Sebastian Raschka <
> m...@sebastianraschka.com> escribió:
> >>
> >> Thanks, Javier,
> >>
> >> however, the max_features is n_features by default. But if you execute
> sth like
> >>
> >> import numpy as np
> >> from sklearn.datasets import load_iris
> >> from sklearn.model_selection import train_test_split
> >> from sklearn.tree import DecisionTreeClassifier
> >>
> >> iris = load_iris()
> >> X, y = iris.data, iris.target
> >> X_train, X_test, y_train, y_test = train_test_split(X, y,
> >>                                                   test_size=0.3,
> >>                                                   random_state=123,
> >>                                                   shuffle=True,
> >>                                                   stratify=y)
> >>
> >> for i in range(20):
> >>   tree = DecisionTreeClassifier()
> >>   tree.fit(X_train, y_train)
> >>   print(tree.score(X_test, y_test))
> >>
> >>
> >>
> >> You will find that the tree will produce different results if you don't
> fix the random seed. I suspect, related to what you said about the random
> feature selection if max_features is not n_features, that there is
> generally some sorting of the features going on, and the different trees
> are then due to tie-breaking if two features have the same information gain?
> >>
> >> Best,
> >> Sebastian
> >>
> >>
> >>
> >>> On Oct 27, 2018, at 6:16 PM, Javier López <jlo...@ende.cc> wrote:
> >>>
> >>> Hi Sebastian,
> >>>
> >>> I think the random state is used to select the features that go into
> each split (look at the `max_features` parameter)
> >>>
> >>> Cheers,
> >>> Javier
> >>>
> >>> On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka <
> m...@sebastianraschka.com> wrote:
> >>> Hi all,
> >>>
> >>> when I was implementing a bagging classifier based on scikit-learn's
> DecisionTreeClassifier, I noticed that the results were not deterministic
> and found that this was due to the random_state in the
> DescisionTreeClassifier (which is set to None by default).
> >>>
> >>> I am wondering what exactly this random state is used for? I can
> imaging it being used for resolving ties if the information gain for
> multiple features is the same, or it could be that the feature splits of
> continuous features is different? (I thought the heuristic is to sort the
> features and to consider those feature values next to each associated with
> examples that have different class labels -- but is there maybe some random
> subselection involved?)
> >>>
> >>> If someone knows more about this, where the random_state is used, I'd
> be happy to hear it :)
> >>>
> >>> Also, we could then maybe add the info to the DecisionTreeClassifier's
> docstring, which is currently a bit too generic to be useful, I think:
> >>>
> >>>
> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py
> >>>
> >>>
> >>>   random_state : int, RandomState instance or None, optional
> (default=None)
> >>>       If int, random_state is the seed used by the random number
> generator;
> >>>       If RandomState instance, random_state is the random number
> generator;
> >>>       If None, the random number generator is the RandomState instance
> used
> >>>       by `np.random`.
> >>>
> >>>
> >>> Best,
> >>> Sebastian
> >>> _______________________________________________
> >>> scikit-learn mailing list
> >>> scikit-learn@python.org
> >>> https://mail.python.org/mailman/listinfo/scikit-learn
> >>> _______________________________________________
> >>> scikit-learn mailing list
> >>> scikit-learn@python.org
> >>> https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn@python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 
Piotr Szymański
nied...@gmail.com
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to