Re: [Scikit-learn-general] Utility of random_state parameter for decision trees

Kevin Markham Sun, 18 Oct 2015 18:14:11 -0700

Andy and Arnaud: Thank you for the answers, and also for your work on this
excellent library!


Dale: Thank you for the pointer!

Best,
Kevin


On Fri, Oct 16, 2015 at 8:51 AM, Dale Smith <dsm...@nexidia.com> wrote:

> I am studying Gilles Louppe's dissertation, which contains the best
> explanation for various properties of tree methods. If you want to know
> more, I would start here.
>
> http://www.montefiore.ulg.ac.be/~glouppe/pdf/phd-thesis.pdf
>
> Dale Smith, Ph.D.
> Data Scientist
> 
>
>
> d. 404.495.7220 x 4008   f. 404.795.7221
> Nexidia Corporate | 3565 Piedmont Road, Building Two, Suite 400 | Atlanta,
> GA 30305
>
>
>
>
> -----Original Message-----
> From: Arnaud Joly [mailto:a.j...@ulg.ac.be]
> Sent: Thursday, October 15, 2015 7:29 AM
> To: scikit-learn-general@lists.sourceforge.net
> Subject: Re: [Scikit-learn-general] Utility of random_state parameter for
> decision trees
>
> Your intuition is correct. For a decision tree with max_feature=None, the
> random_state is used to break ties randomly.
>
> Cheers,
> Arnaud
>
>
> > On 14 Oct 2015, at 17:33, Kevin Markham <justmark...@gmail.com> wrote:
> >
> > Hello,
> >
> > I'm a data science instructor that uses scikit-learn extensively in the
> classroom. Yesterday I was teaching decision trees, and I summarized the
> tree building process (for regression trees) as follows:
> >
> > 1. Begin at the top of the tree.
> > 2. For every feature, examine every possible cutpoint, and choose the
> feature and cutpoint such that the resulting tree has the lowest possible
> mean squared error (MSE). Make that split.
> > 3. Examine the two resulting regions, and again make a single split (in
> one of the regions) to minimize the MSE.
> > 4. Keep repeating step 3 until a stopping criterion is met.
> >
> > One question that came up is why there is a random_state parameter for a
> DecisionTreeRegressor (or a DecisionTreeClassifier). Assuming that an
> exhaustive search is performed before each split (meaning that every
> possible cutpoint is checked for every feature), it is not obvious to me
> what randomness is used during the tree building process, such that a
> random_state is necessary.
> >
> > My best guesses were that the random_state is used for tiebreaking, or
> perhaps that the search for the best split is not exhaustive and thus
> random_state affects the way in which the search is performed.
> >
> > In summary, I am asking: Why is a random_state necessary for decision
> trees?
> >
> > As a corollary, I am asking: Am I correctly representing how a decision
> tree is built?
> >
> > Thank you very much!
> > Kevin Markham
> > ----------------------------------------------------------------------
> > -------- _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Utility of random_state parameter for decision trees

Reply via email to