Hello, I'm a data science instructor that uses scikit-learn extensively in the classroom. Yesterday I was teaching decision trees, and I summarized the tree building process (for regression trees) as follows:
1. Begin at the top of the tree. 2. For every feature, examine every possible cutpoint, and choose the feature and cutpoint such that the resulting tree has the lowest possible mean squared error (MSE). Make that split. 3. Examine the two resulting regions, and again make a single split (in one of the regions) to minimize the MSE. 4. Keep repeating step 3 until a stopping criterion is met. One question that came up is why there is a random_state parameter for a DecisionTreeRegressor (or a DecisionTreeClassifier). Assuming that an exhaustive search is performed before each split (meaning that every possible cutpoint is checked for every feature), it is not obvious to me what randomness is used during the tree building process, such that a random_state is necessary. My best guesses were that the random_state is used for tiebreaking, or perhaps that the search for the best split is not exhaustive and thus random_state affects the way in which the search is performed. In summary, I am asking: Why is a random_state necessary for decision trees? As a corollary, I am asking: Am I correctly representing how a decision tree is built? Thank you very much! Kevin Markham
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general