Hmmm that’s weird...

Have you tried to plot the trees (the decision rules) for the tree with 
different seeds, and see if the gain for the first split is the same even if 
the split itself is different?

I’d at least try that before diving into the source code...

Cheers,

--
Julio

> El 28 oct 2018, a las 2:24, Sebastian Raschka <m...@sebastianraschka.com> 
> escribió:
> 
> Thanks, Javier,
> 
> however, the max_features is n_features by default. But if you execute sth 
> like
> 
> import numpy as np
> from sklearn.datasets import load_iris
> from sklearn.model_selection import train_test_split
> from sklearn.tree import DecisionTreeClassifier
> 
> iris = load_iris()
> X, y = iris.data, iris.target
> X_train, X_test, y_train, y_test = train_test_split(X, y,
>                                                    test_size=0.3,
>                                                    random_state=123,
>                                                    shuffle=True,
>                                                    stratify=y)
> 
> for i in range(20):
>    tree = DecisionTreeClassifier()
>    tree.fit(X_train, y_train)
>    print(tree.score(X_test, y_test))
> 
> 
> 
> You will find that the tree will produce different results if you don't fix 
> the random seed. I suspect, related to what you said about the random feature 
> selection if max_features is not n_features, that there is generally some 
> sorting of the features going on, and the different trees are then due to 
> tie-breaking if two features have the same information gain?
> 
> Best,
> Sebastian
> 
> 
> 
>> On Oct 27, 2018, at 6:16 PM, Javier López <jlo...@ende.cc> wrote:
>> 
>> Hi Sebastian,
>> 
>> I think the random state is used to select the features that go into each 
>> split (look at the `max_features` parameter)
>> 
>> Cheers,
>> Javier
>> 
>> On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka 
>> <m...@sebastianraschka.com> wrote:
>> Hi all,
>> 
>> when I was implementing a bagging classifier based on scikit-learn's 
>> DecisionTreeClassifier, I noticed that the results were not deterministic 
>> and found that this was due to the random_state in the 
>> DescisionTreeClassifier (which is set to None by default).
>> 
>> I am wondering what exactly this random state is used for? I can imaging it 
>> being used for resolving ties if the information gain for multiple features 
>> is the same, or it could be that the feature splits of continuous features 
>> is different? (I thought the heuristic is to sort the features and to 
>> consider those feature values next to each associated with examples that 
>> have different class labels -- but is there maybe some random subselection 
>> involved?)
>> 
>> If someone knows more about this, where the random_state is used, I'd be 
>> happy to hear it :)
>> 
>> Also, we could then maybe add the info to the DecisionTreeClassifier's 
>> docstring, which is currently a bit too generic to be useful, I think:
>> 
>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py
>> 
>> 
>>    random_state : int, RandomState instance or None, optional (default=None)
>>        If int, random_state is the seed used by the random number generator;
>>        If RandomState instance, random_state is the random number generator;
>>        If None, the random number generator is the RandomState instance used
>>        by `np.random`.
>> 
>> 
>> Best,
>> Sebastian
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to