Re: [Scikit-learn-general] Utility of random_state parameter for decision trees

Dale Smith Fri, 16 Oct 2015 05:52:43 -0700

I am studying Gilles Louppe's dissertation, which contains the best explanation 
for various properties of tree methods. If you want to know more, I would start 
here.


http://www.montefiore.ulg.ac.be/~glouppe/pdf/phd-thesis.pdf

Dale Smith, Ph.D.
Data Scientist



d. 404.495.7220 x 4008   f. 404.795.7221
Nexidia Corporate | 3565 Piedmont Road, Building Two, Suite 400 | Atlanta, GA 
30305

    


-----Original Message-----
From: Arnaud Joly [mailto:a.j...@ulg.ac.be] 
Sent: Thursday, October 15, 2015 7:29 AM
To: scikit-learn-general@lists.sourceforge.net
Subject: Re: [Scikit-learn-general] Utility of random_state parameter for 
decision trees

Your intuition is correct. For a decision tree with max_feature=None, the 
random_state is used to break ties randomly.

Cheers,
Arnaud


> On 14 Oct 2015, at 17:33, Kevin Markham <justmark...@gmail.com> wrote:
> 
> Hello,
> 
> I'm a data science instructor that uses scikit-learn extensively in the 
> classroom. Yesterday I was teaching decision trees, and I summarized the tree 
> building process (for regression trees) as follows:
> 
> 1. Begin at the top of the tree.
> 2. For every feature, examine every possible cutpoint, and choose the feature 
> and cutpoint such that the resulting tree has the lowest possible mean 
> squared error (MSE). Make that split.
> 3. Examine the two resulting regions, and again make a single split (in one 
> of the regions) to minimize the MSE.
> 4. Keep repeating step 3 until a stopping criterion is met.
> 
> One question that came up is why there is a random_state parameter for a 
> DecisionTreeRegressor (or a DecisionTreeClassifier). Assuming that an 
> exhaustive search is performed before each split (meaning that every possible 
> cutpoint is checked for every feature), it is not obvious to me what 
> randomness is used during the tree building process, such that a random_state 
> is necessary.
> 
> My best guesses were that the random_state is used for tiebreaking, or 
> perhaps that the search for the best split is not exhaustive and thus 
> random_state affects the way in which the search is performed.
> 
> In summary, I am asking: Why is a random_state necessary for decision trees?
> 
> As a corollary, I am asking: Am I correctly representing how a decision tree 
> is built?
> 
> Thank you very much!
> Kevin Markham
> ----------------------------------------------------------------------
> -------- _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Utility of random_state parameter for decision trees

Reply via email to