Just a small side note that I've come across with Random Forests which in the end form an ensemble of Decision Trees. I ran a thousand iterations of RFs on multi-label data and managed to get a 4-10 percentage points difference in subset accuracy, depending on the data set, just as a random effect, while I've seen papers report differences of just a couple pp as statistically significant after a non-parametric rank test.
On Sun, Oct 28, 2018 at 7:44 AM Sebastian Raschka <m...@sebastianraschka.com> wrote: > Good suggestion. The trees look different. I.e., there seems to be a tie > at some point between choosing X[:, 0] <= 4.95 and X[:, 3] <= 1.65 > > So, I suspect that the features are shuffled, let's call it X_shuffled. > Then at some point the max_features are selected, which is by default > X_shuffled[:, :n_features]. Based on that, if there's a tie between > impurities for the different features, it's probably selecting the first > feature in the array among these ties. > > If this is true (have to look into the code more deeply then) I wonder if > it would be worthwhile to change the implementation such that the shuffling > only occurs if max_features < n_feature, because this way we could have > deterministic behavior for the trees by default, which I'd find more > intuitive for plain decision trees tbh. > > Let me know what you all think. > > Best, > Sebastian > > > On Oct 27, 2018, at 11:07 PM, Julio Antonio Soto de Vicente < > ju...@esbet.es> wrote: > > > > Hmmm that’s weird... > > > > Have you tried to plot the trees (the decision rules) for the tree with > different seeds, and see if the gain for the first split is the same even > if the split itself is different? > > > > I’d at least try that before diving into the source code... > > > > Cheers, > > > > -- > > Julio > > > >> El 28 oct 2018, a las 2:24, Sebastian Raschka < > m...@sebastianraschka.com> escribió: > >> > >> Thanks, Javier, > >> > >> however, the max_features is n_features by default. But if you execute > sth like > >> > >> import numpy as np > >> from sklearn.datasets import load_iris > >> from sklearn.model_selection import train_test_split > >> from sklearn.tree import DecisionTreeClassifier > >> > >> iris = load_iris() > >> X, y = iris.data, iris.target > >> X_train, X_test, y_train, y_test = train_test_split(X, y, > >> test_size=0.3, > >> random_state=123, > >> shuffle=True, > >> stratify=y) > >> > >> for i in range(20): > >> tree = DecisionTreeClassifier() > >> tree.fit(X_train, y_train) > >> print(tree.score(X_test, y_test)) > >> > >> > >> > >> You will find that the tree will produce different results if you don't > fix the random seed. I suspect, related to what you said about the random > feature selection if max_features is not n_features, that there is > generally some sorting of the features going on, and the different trees > are then due to tie-breaking if two features have the same information gain? > >> > >> Best, > >> Sebastian > >> > >> > >> > >>> On Oct 27, 2018, at 6:16 PM, Javier López <jlo...@ende.cc> wrote: > >>> > >>> Hi Sebastian, > >>> > >>> I think the random state is used to select the features that go into > each split (look at the `max_features` parameter) > >>> > >>> Cheers, > >>> Javier > >>> > >>> On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka < > m...@sebastianraschka.com> wrote: > >>> Hi all, > >>> > >>> when I was implementing a bagging classifier based on scikit-learn's > DecisionTreeClassifier, I noticed that the results were not deterministic > and found that this was due to the random_state in the > DescisionTreeClassifier (which is set to None by default). > >>> > >>> I am wondering what exactly this random state is used for? I can > imaging it being used for resolving ties if the information gain for > multiple features is the same, or it could be that the feature splits of > continuous features is different? (I thought the heuristic is to sort the > features and to consider those feature values next to each associated with > examples that have different class labels -- but is there maybe some random > subselection involved?) > >>> > >>> If someone knows more about this, where the random_state is used, I'd > be happy to hear it :) > >>> > >>> Also, we could then maybe add the info to the DecisionTreeClassifier's > docstring, which is currently a bit too generic to be useful, I think: > >>> > >>> > https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py > >>> > >>> > >>> random_state : int, RandomState instance or None, optional > (default=None) > >>> If int, random_state is the seed used by the random number > generator; > >>> If RandomState instance, random_state is the random number > generator; > >>> If None, the random number generator is the RandomState instance > used > >>> by `np.random`. > >>> > >>> > >>> Best, > >>> Sebastian > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn@python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn@python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn@python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Piotr Szymański nied...@gmail.com
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn