Hello,

I'm a data science instructor that uses scikit-learn extensively in the
classroom. Yesterday I was teaching decision trees, and I summarized the
tree building process (for regression trees) as follows:

1. Begin at the top of the tree.
2. For every feature, examine every possible cutpoint, and choose the
feature and cutpoint such that the resulting tree has the lowest possible
mean squared error (MSE). Make that split.
3. Examine the two resulting regions, and again make a single split (in one
of the regions) to minimize the MSE.
4. Keep repeating step 3 until a stopping criterion is met.

One question that came up is why there is a random_state parameter for a
DecisionTreeRegressor (or a DecisionTreeClassifier). Assuming that an
exhaustive search is performed before each split (meaning that every
possible cutpoint is checked for every feature), it is not obvious to me
what randomness is used during the tree building process, such that a
random_state is necessary.

My best guesses were that the random_state is used for tiebreaking, or
perhaps that the search for the best split is not exhaustive and thus
random_state affects the way in which the search is performed.

In summary, I am asking: Why is a random_state necessary for decision trees?

As a corollary, I am asking: Am I correctly representing how a decision
tree is built?

Thank you very much!
Kevin Markham
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to